datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.84k stars 2.9k forks source link

asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit (try to fully ingest Tableau Server) #8475

Closed YuriyGavrilov closed 1 year ago

YuriyGavrilov commented 1 year ago

Describe the bug

Hi there is a strange error with tableau ingestion. We can't perform it properly on the latest version of cli 0.10.5.1

'request method get, url: https://tableau-dev.group.s7/api/3.10/sites/336df529-0426-4f78-8eec-71ac07fdc5c9/workbooks\n' '[08:11:25] Begin blocking request to https://tableau-dev.group.s7/api/3.10/sites/336df529-0426-4f78-8eec-71ac07fdc5c9/workbooks\n' '[08:11:25] Call finished\n' '[08:11:25] Request complete\n' '[08:11:26] Async request returned: received <Response [200]>\n' 'Response status: <Response [200]>\n' 'Server response from https://tableau-dev.group.s7/api/3.10/sites/336df529-0426-4f78-8eec-71ac07fdc5c9/workbooks\n', '2023-07-21 08:11:26.590688 [exec_id=54e902c0-2172-45e9-bea9-66f553c1dced] INFO: Caught exception EXECUTING ' 'task_id=54e902c0-2172-45e9-bea9-66f553c1dced, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.9/asyncio/streams.py", line 540, in readline\n' ' line = await self.readuntil(sep)\n' ' File "/usr/local/lib/python3.9/asyncio/streams.py", line 635, in readuntil\n' ' raise exceptions.LimitOverrunError(\n' 'asyncio.exceptions.LimitOverrunError: Separator is found, but chunk is longer than limit\n' '\n' 'During handling of the above exception, another exception occurred:\n' '\n' 'Traceback (most recent call last):\n' ' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n' ' task_event_loop.run_until_complete(task_future)\n' ' File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete\n' ' return future.result()\n' ' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 147, in execute\n' ' await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n' ' File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 99, in _read_output_lines\n' ' line_bytes = await ingest_process.stdout.readline()\n' ' File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n' ' raise ValueError(e.args[0])\n' 'ValueError: Separator is found, but chunk is longer than limit\n']} Execution finished with errors.

full log in attach. exec-urn li dataHubExecutionRequest 54e902c0-2172-45e9-bea9-66f553c1dced.log

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Properly ingest full tableau server (with thousands of dashboards)

Screenshots

Desktop (please complete the following information):

Additional context We are trying to properly fully ingest tableau during 1 year and still staying with errors((

This yaml we used to ingest:

`transformers:

    type: simple_add_dataset_tags
    config:
        tag_urns:
            - 'urn:li:tag:product.TABLEAU_BIG_TEST'

sink: type: datahub-rest config: server: 'http://datahub-gms-datahub-gms:8080' token: '${ingestions_by_yua}' source: type: tableau config: ingest_owner: true platform_instance: TABLEAUDEV connect_uri: 'https://tableau-dev.group.s7' password: '${tableau_dev}' ingest_tables_external: true env: DEV ingest_tags: true username: datahub_tableau_dev page_size: 1 stateful_ingestion: enabled: false remove_stale_metadata: true `

mayurinehate commented 1 year ago

@YuriyGavrilov what is your datahub-actions version ? This issue was resolved in datahub actions version 0.0.8 and above.

YuriyGavrilov commented 1 year ago

@YuriyGavrilov what is your datahub-actions version ? This issue was resolved in datahub actions version 0.0.8 and above.

hm.. @mayurinehate it seams old version acryl-datahub-actions @ file:///actions-src acryl-executor==0.0.3.5

will try to fix it and come back with tesults. Thanks @mayurinehate for giving me the idea

YuriyGavrilov commented 1 year ago

@mayurinehate We are maked datahub-actions update. There is no such error with "asyncio.exceptions.LimitOverrunError: Separator" but there are enother one:

This is just a part of all of them:

`com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=? at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42) at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50) at com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:191) at jdk.internal.reflect.GeneratedMethodAccessor53.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177) at com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333) at com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47) at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86) at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73) at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753) at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731) at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108) at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72) at com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55) at com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:262) at com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:294) at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:262) at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:232) at com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215) at com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171) at com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:130) at com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70) at com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70) at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132) at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62) at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55) at com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58) at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132) at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62) at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55) at com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50) at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132) at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62) at com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:106) at com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75) at com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61) at com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101) at com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at com.linkedin.restli.server.RestliHandlerServlet.service(RestliHandlerServlet.java:21) at com.linkedin.restli.server.RestliHandlerServlet.handleRequest(RestliHandlerServlet.java:26) at org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799) at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1631) at com.datahub.auth.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:102) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) at org.eclipse.jetty.server.Server.handle(Server.java:516) at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=? at io.ebean.config.dbplatform.SqlCodeTranslator.translate(SqlCodeTranslator.java:52) at io.ebean.config.dbplatform.DatabasePlatform.translate(DatabasePlatform.java:219) at io.ebeaninternal.server.transaction.TransactionManager.translate(TransactionManager.java:246) at io.ebeaninternal.server.transaction.JdbcTransaction.translate(JdbcTransaction.java:698) at io.ebeaninternal.server.transaction.JdbcTransaction.batchFlush(JdbcTransaction.java:680) at io.ebeaninternal.server.transaction.JdbcTransaction.internalBatchFlush(JdbcTransaction.java:796) at io.ebeaninternal.server.transaction.JdbcTransaction.flushCommitAndNotify(JdbcTransaction.java:1005) at io.ebeaninternal.server.transaction.JdbcTransaction.commit(JdbcTransaction.java:1057) at io.ebeaninternal.api.ScopeTrans.commitTransaction(ScopeTrans.java:136) at io.ebeaninternal.api.ScopedTransaction.commit(ScopedTransaction.java:110) at com.linkedin.metadata.entity.ebean.EbeanAspectDao.runInTransactionWithRetry(EbeanAspectDao.java:503) at com.linkedin.metadata.entity.EntityService.ingestAspectToLocalDB(EntityService.java:585) at com.linkedin.metadata.entity.EntityService.wrappedIngestAspectToLocalDB(EntityService.java:538) at com.linkedin.metadata.entity.EntityService.upsertAspect(EntityService.java:999) at com.linkedin.metadata.entity.EntityService.performUpsert(EntityService.java:948) at com.linkedin.metadata.entity.EntityService.ingestProposal(EntityService.java:896) at com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$2(AspectResource.java:198) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$3(AspectResource.java:198) at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30) ... 88 more Caused by: java.sql.BatchUpdateException: Batch entry 0 update metadata_aspect_v2 set metadata='{"guid":"d8d62ce94b24e66f026e37dcdda2a9b0"}', createdOn='2023-07-21 11:50:43.813+00', createdBy='urn:li:corpuser:v.platonov', createdFor=NULL, systemmetadata='{"registryVersion":"0.0.0.0-dev","runId":"fb1ccde6-f851-48e0-91d9-3c7c0e109f71","registryName":"unknownRegistry","lastObserved":1689940243782}' where urn='urn:li:container:d8d62ce94b24e66f026e37dcdda2a9b0' and aspect='containerKey' and version=0 was aborted: ERROR: could not serialize access due to concurrent update Call getNextException to see other errors in the batch. at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:165) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:559) at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:887) at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:910) at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1649) at io.ebean.datasource.delegate.PreparedStatementDelegator.executeBatch(PreparedStatementDelegator.java:357) at io.ebeaninternal.server.persist.BatchedPstmt.executeAndCheckRowCounts(BatchedPstmt.java:130) at io.ebeaninternal.server.persist.BatchedPstmt.executeBatch(BatchedPstmt.java:97) at io.ebeaninternal.server.persist.BatchedPstmtHolder.flush(BatchedPstmtHolder.java:124) at io.ebeaninternal.server.persist.BatchControl.flushPstmtHolder(BatchControl.java:206) at io.ebeaninternal.server.persist.BatchControl.executeNow(BatchControl.java:220) at io.ebeaninternal.server.persist.BatchedBeanHolder.executeNow(BatchedBeanHolder.java:100) at io.ebeaninternal.server.persist.BatchControl.flush(BatchControl.java:271) at io.ebeaninternal.server.persist.BatchControl.flush(BatchControl.java:227) at io.ebeaninternal.server.transaction.JdbcTransaction.batchFlush(JdbcTransaction.java:678) ... 103 more Caused by: org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) ... 117 more

[2023-07-21 11:50:43,864] ERROR {datahub.ingestion.run.pipeline:68} - failed to write record with workunit urn:li:container:d8d62ce94b24e66f026e37dcdda2a9b0-subTypes with ('Unable to emit metadata to DataHub GMS: javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?', 'status': 500, 'id': 'urn:li:container:d8d62ce94b24e66f026e37dcdda2a9b0'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?', 'status': 500, 'id': 'urn:li:container:d8d62ce94b24e66f026e37dcdda2a9b0'} `

YuriyGavrilov commented 1 year ago

This issue is solved by updating datahub-actions to the latest version

YuriyGavrilov commented 1 year ago

I caught this error again after updating datahub-actions and the entire datahub to version 10.5.

Tryed to igest just two Tableau Projects. with this connection:

`transformers:

    type: simple_add_dataset_tags
    config:
        tag_urns:
            - 'urn:li:tag:product.BI_Экипажи'

sink: type: datahub-rest config: server: 'http://datahub-gms-datahub-gms:8080' token: '${ingestions_by_yua}' source: type: tableau config: ingest_owner: false platform_instance: TABLEAUDEV connect_uri: 'https://tableau-dev.group.s7' password: '${tableau_dev}' ingest_tables_external: true ingest_tags: true username: datahub_tableau_dev page_size: 2 stateful_ingestion: enabled: true remove_stale_metadata: true project_pattern: allow:

The results log is:

exec-urn li dataHubExecutionRequest b75c4836-f696-4116-a91a-db7c3a60e3ae.log

hsheth2 commented 1 year ago

@YuriyGavrilov from the logs, it looks like you're not on the latest datahub-actions version. This error will be fixed by upgrading.

YuriyGavrilov commented 1 year ago

@hsheth2 Thanks for your reply, that’s correct. We solved it.