airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.17k stars 4.14k forks source link

Actor Catalog Fetch Event Duplicate Key exception #21208

Open michelgalle opened 1 year ago

michelgalle commented 1 year ago

Environment

Current Behavior

After setting up a connector (unfortunately I don't have the logs nor remember which connector it was), I started getting the following error in the ui: image

Expected Behavior

The ui behave as expected.

Logs

The only logs I can get are the ones posted on a slack thread.

2022-12-22 21:49:01 ERROR i.a.s.a.ApiHelper(execute):28 - Unexpected Exception
java.lang.IllegalStateException: Duplicate key b601e028-e003-4290-9acc-5fc72ca44776 (attempted merging values io.airbyte.config.ActorCatalogFetchEvent@50671700[id=<null>,actorId=b601e028-e003-4290-9acc-5fc72ca44776,actorCatalogId=29181719-0c50-4489-b5aa-03c0e10cb227,configHash=<null>,connectorVersion=<null>,createdAt=1671739270] and io.airbyte.config.ActorCatalogFetchEvent@69081582[id=<null>,actorId=b601e028-e003-4290-9acc-5fc72ca44776,actorCatalogId=d664d0dd-2ded-4a9e-9737-fba55e1e79eb,configHash=<null>,connectorVersion=<null>,createdAt=1671739270])

Steps to Reproduce

  1. Create a new ingest.
  2. During schema discovery somehow airbyte executed/added to the metadata db 2 rows to actor catalog fetch event with the same timestamp. Not sure why it happened.

I already did some troubleshooting here and know the reason for the error. Since there are 2 rows with same timestamp, this should be using row_number() instead of rank(). Since I am new to airbyte, I am not able to inform whether the code that inserts into the table should be fixed so we do not have 2 rows with same timestamp for same actor or simply fixing the query I mentioned above is enough.

Are you willing to submit a PR?

No

hugozap commented 1 year ago

I confirm that by using row_number as @michelgalle stated, the problem goes away at least from a UI standpoint.

mfsiega-airbyte commented 1 year ago

Thanks for the report, and the investigation!

It seems like the 2 rows with the same timestamp are happening because the frontend is actually requesting schema discovery twice. This isn't intended, but it shouldn't case the backend to fail like this. Indeed as you note, the combination of actor_catalog_id, actor_id, created_at isn't guaranteed to be unique.

Using row_number seems like a good option. We can also include the id column which is intended to be unique. I'll send a PR with those changes.

hugozap commented 1 year ago

@mfsiega-airbyte I found that both getMostRecentActorCatalogFetchEventForSource and getMostRecentActorCatalogFetchEventForSources need to be updated as they both fail when there are duplicate records.

For getMostRecentActorCatalogFetchEventForSource I had to include the ID as part of the groupBy (as stated in the error I got).

After both methods were patched I could setup a connection succesfully.

mfsiega-airbyte commented 1 year ago

@hugozap thanks for the pointer! I put up a PR for getMostRecentActorCatalogFetchEventForSources since that one was a bit clearer.

For getMostRecentActorCatalogFetchEventForSource I'm not totally following - I don't see in the code that it's doing a groupBy. Am I misunderstanding? (If you have your patch in a fork/branch somewhere, feel free to point me there instead?)

hugozap commented 1 year ago

@mfsiega-airbyte Sorry I was not clear.

With getMostRecentActorCatalogFetchEventForSource the problem is that it will also fail if duplicate records were written. I know this because I first patched locally getMostRecentActorCatalogFetchEventForSources and then the UI worked but the moment I tried to setup a new connection I got an error again. The cause of the error is that the limit(1) will fail because it won't know which record to choose.

I fixed it locally by adding ACTOR_CATALOG_FETCH_EVENT.id.desc() to the ORDER BY to break the tie and limit(1) will work again.

To test that it fails, you would have to simulate the current write bug by inserting a duplicate record (same actor id and create date).