datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.89k stars 2.93k forks source link

DuplicateKeyException when ingesting dbt metadata #4124

Closed LucasRoesler closed 2 years ago

LucasRoesler commented 2 years ago

Describe the bug When I try to ingest dbt metadata, I get an error that the ingestion failed with a backoff. Looking into the stack trace I see DuplicateKeyException

Oddly, I can get the ingestion to work in one of two ways

  1. do an initial ingestion of the source db, e.g. directly ingest the postgres db and then do the dbt ingestion
  2. When I disable the dbt node creation using disable_dbt_node_creation: True and then reingest the same dbt data with disable_dbt_node_creation: False

To Reproduce

Because the reproduction is a bit complex, I have created a complete reproduction repo here https://github.com/LucasRoesler/dbt-datahub-ingestion-bug

I have included the exception text here ``` 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: ' 'com.datahub.util.exception.RetryLimitReached: Failed to add after 3 retries\n' '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n' '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\n' '\tat com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:178)\n' '\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n' '\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n' '\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n' '\tat java.lang.reflect.Method.invoke(Method.java:498)\n' '\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:172)\n' '\tat com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:326)\n' '\tat ' 'com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)\n' '\tat ' 'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)\n' '\tat ' 'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)\n' '\tat java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)\n' '\tat java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)\n' '\tat java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)\n' '\tat ' 'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)\n' '\tat com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)\n' '\tat com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:218)\n' '\tat ' 'com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:242)\n' '\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:211)\n' '\tat com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:181)\n' '\tat com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:164)\n' '\tat com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:120)\n' '\tat com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)\n' '\tat ' 'com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\n' '\tat com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\n' '\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n' '\tat ' 'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n' '\tat ' 'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n' '\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n' '\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n' '\tat com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)\n' '\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n' '\tat ' 'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n' '\tat ' 'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n' '\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n' '\tat com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\n' '\tat com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)\n' '\tat com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:72)\n' '\tat ' 'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\n' '\tat ' 'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\n' '\tat com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\n' '\tat com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:96)\n' '\tat com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)\n' '\tat ' 'com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)\n' '\tat com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)\n' '\tat com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105)\n' '\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n' '\tat ' 'com.linkedin.restli.server.spring.ParallelRestliHttpRequestHandler.handleRequest(ParallelRestliHttpRequestHandler.java:63)\n' '\tat ' 'org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73)\n' '\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n' '\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:852)\n' '\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1604)\n' '\tat com.datahub.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:77)\n' '\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1591)\n' '\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:542)\n' '\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n' '\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:536)\n' '\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n' '\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n' '\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1581)\n' '\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n' '\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1307)\n' '\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n' '\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:482)\n' '\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1549)\n' '\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n' '\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1204)\n' '\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n' '\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n' '\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n' '\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n' '\tat org.eclipse.jetty.server.Server.handle(Server.java:494)\n' '\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:374)\n' '\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:268)\n' '\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n' '\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n' '\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)\n' '\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n' '\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n' '\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n' '\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n' '\tat ' 'org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367)\n' '\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)\n' '\tat org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:918)\n' '\tat java.lang.Thread.run(Thread.java:748)\n' 'Caused by: com.datahub.util.exception.RetryLimitReached: Failed to add after 3 retries\n' '\tat com.linkedin.metadata.entity.ebean.EbeanAspectDao.runInTransactionWithRetry(EbeanAspectDao.java:458)\n' '\tat ' 'com.linkedin.metadata.entity.ebean.EbeanEntityService.ingestAspectsToLocalDB(EbeanEntityService.java:326)\n' '\tat com.linkedin.metadata.entity.EntityService.wrappedIngestAspectsToLocalDB(EntityService.java:239)\n' '\tat com.linkedin.metadata.entity.EntityService.ingestAspects(EntityService.java:291)\n' '\tat com.linkedin.metadata.entity.EntityService.ingestSnapshotUnion(EntityService.java:703)\n' '\tat com.linkedin.metadata.entity.EntityService.ingestEntity(EntityService.java:606)\n' '\tat com.linkedin.metadata.resources.entity.EntityResource.lambda$ingest$4(EntityResource.java:179)\n' '\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\n' '\t... 85 more\n' 'Caused by: io.ebean.DuplicateKeyException: Error when batch flush on sql: insert into metadata_aspect_v2 ' '(urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?,?,?,?,?)\n' '\tat io.ebean.config.dbplatform.SqlCodeTranslator.translate(SqlCodeTranslator.java:46)\n' '\tat io.ebean.config.dbplatform.DatabasePlatform.translate(DatabasePlatform.java:219)\n' '\tat io.ebeaninternal.server.transaction.TransactionManager.translate(TransactionManager.java:246)\n' '\tat io.ebeaninternal.server.transaction.JdbcTransaction.translate(JdbcTransaction.java:698)\n' '\tat io.ebeaninternal.server.transaction.JdbcTransaction.batchFlush(JdbcTransaction.java:680)\n' '\tat io.ebeaninternal.server.transaction.JdbcTransaction.internalBatchFlush(JdbcTransaction.java:796)\n' '\tat io.ebeaninternal.server.transaction.JdbcTransaction.flushCommitAndNotify(JdbcTransaction.java:1005)\n' '\tat io.ebeaninternal.server.transaction.JdbcTransaction.commit(JdbcTransaction.java:1057)\n' '\tat com.linkedin.metadata.entity.ebean.EbeanAspectDao.runInTransactionWithRetry(EbeanAspectDao.java:449)\n' '\t... 92 more\n' 'Caused by: java.sql.BatchUpdateException: Duplicate entry ' "'urn:li:dataset:(urn:li:dataPlatform:snowflake,citibike_tripdata.' for key 'PRIMARY'\n" '\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n' '\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n' '\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n' '\tat java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n' '\tat com.mysql.cj.util.Util.handleNewInstance(Util.java:192)\n' '\tat com.mysql.cj.util.Util.getInstance(Util.java:167)\n' '\tat com.mysql.cj.util.Util.getInstance(Util.java:174)\n' '\tat com.mysql.cj.jdbc.exceptions.SQLError.createBatchUpdateException(SQLError.java:224)\n' '\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchSerially(ClientPreparedStatement.java:853)\n' '\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:435)\n' '\tat com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:796)\n' '\tat ' 'io.ebean.datasource.delegate.PreparedStatementDelegator.executeBatch(PreparedStatementDelegator.java:357)\n' '\tat io.ebeaninternal.server.persist.BatchedPstmt.executeAndCheckRowCounts(BatchedPstmt.java:130)\n' '\tat io.ebeaninternal.server.persist.BatchedPstmt.executeBatch(BatchedPstmt.java:97)\n' '\tat io.ebeaninternal.server.persist.BatchedPstmtHolder.flush(BatchedPstmtHolder.java:124)\n' '\tat io.ebeaninternal.server.persist.BatchControl.flushPstmtHolder(BatchControl.java:206)\n' '\tat io.ebeaninternal.server.persist.BatchControl.executeNow(BatchControl.java:220)\n' '\tat io.ebeaninternal.server.persist.BatchedBeanHolder.executeNow(BatchedBeanHolder.java:95)\n' '\tat io.ebeaninternal.server.persist.BatchControl.flush(BatchControl.java:271)\n' '\tat io.ebeaninternal.server.persist.BatchControl.flush(BatchControl.java:227)\n' '\tat io.ebeaninternal.server.transaction.JdbcTransaction.batchFlush(JdbcTransaction.java:678)\n' '\t... 96 more\n' 'Caused by: java.sql.SQLIntegrityConstraintViolationException: Duplicate entry ' "'urn:li:dataset:(urn:li:dataPlatform:snowflake,citibike_tripdata.' for key 'PRIMARY'\n" '\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:117)\n' '\tat com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)\n' '\tat com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)\n' '\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953)\n' '\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1092)\n' '\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchSerially(ClientPreparedStatement.java:832)\n' '\t... 108 more\n' ```

Expected behavior The ingestion should correctly create both the postgres and the dbt nodes without any prerequisite steps

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

jjoyce0510 commented 2 years ago

Hi @LucasRoesler. If you've had your DataHub instance for some time, you may need to update the DB character encoding - Can you see if following this FAQ works for you? https://datahubproject.io/docs/debugging/#im-seeing-exceptions-in-datahub-gms-container-like-caused-by-javalangillegalstateexception-duplicate-key-comlinkedinmetadataentityebeanebeanaspectv2dd26e011-what-do-i-do

shirshanka commented 2 years ago

Thanks for reporting this with a reproducible setup. The issue has been fixed in acryl-datahub==0.8.26.8. Please re-open if it persists.