airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.94k stars 4.09k forks source link

[source-intercom] Missing parts in conversation_parts sync #39463

Open everron opened 4 months ago

everron commented 4 months ago

Connector Name

source-intercom

Connector Version

since 0.6.6

What step the error happened?

None

Relevant information

We're using the intercom source connector on Airbyte (now v0.6.7) cloud to get conversations and their conversations_parts. Our sync config is the following : incremental, append+dedup with updated_at as cursor field and id is the pk. image

I recently identified that the number of parts synced was different than the number of parts expected, by comparing the number of parts in conversations_parts for each conversation_id and the field statistics.count_conversation_parts in the conversations data. About 60% of conversations extracted had a lower number of conversations parts than expected. We use Redshift as the destination database.

Job history logs did not return any error and all syncs are marked as success. I can provide logs if needed

Relevant log output

No response

Contribute

bgdanterty commented 3 months ago

I have the same issue. A lot of the conversation_parts data is missing, but if I use a custom Python script that collects JSONs from intercom API 2.10 – I can see them all.

luctchak commented 1 month ago

👋 Any update on this issue ? It's blocking me as well 😬

Alexandros-Gk commented 1 month ago

We have the same issue. Conversation parts are missing and also conversations are not properly updated - There are mismatches even on state field - I guess due to the fact that the parts are missing!

It is a bit alarming - given that the impact is major and this can happen without a warning. I imagine a lot of users may not be aware this happens on their data! - We've also sent an issue on the support to help us understand why this happen.

aldogonzalez8 commented 1 month ago

I will look on this.

octavia-squidington-iii commented 1 month ago

Zendesk ticket #8072 has been linked to this issue.

aldogonzalez8 commented 1 month ago

@everron @luctchak @Alexandros-Gk I have identified in my investigation that actually, in the responses from intercom API the conversation_parts.total_count is < than statistics.count_conversation_parts and seems solid 1 diff.

@bgdanterty Are you seeing something different? As you mentioned you get all data with a python script.

We have opened a contact on Intercom on this matter, as I'm not sure if the API just has a wrong count or is sending less data to the conversation_parts array.

image image

everron commented 1 month ago

@aldogonzalez8 There indeed seems to be an issue regarding the conversation_parts counter stats in the GET conversation API response. I also opened a ticket for Intercom about this 👍

However, I did see larger diffs (ie more than 1 part) between the rows count of parts in the loaded table and the real number of parts (seen directly in the Intercom UI)

Alexandros-Gk commented 1 month ago

@aldogonzalez8 Apart from the discrepancy on the API, we have identified certain periods (where the load actually spiked) on which ingestion did not work at all and entire conversations are missing. Not sure if the issue is on Airbyte's or Intercom side image

From a contact with their CS - they can only look back a small time window.

aldogonzalez8 commented 1 month ago

@aldogonzalez8 Apart from the discrepancy on the API, we have identified certain periods (where the load actually spiked) on which ingestion did not work at all and entire conversations are missing. Not sure if the issue is on Airbyte's or Intercom side image

From a contact with their CS - they can only look back a small time window.

@Alexandros-Gk, I'm not sure about this. Did you see anything unusual in the sync logs for those days?

Alexandros-Gk commented 1 month ago

@aldogonzalez8 Apart from a noteworthy drop on sync time there were no errors or warning to dive deeper.

aldogonzalez8 commented 1 month ago

@everron @luctchak @Alexandros-Gk @bgdanterty I got this from Intercom contact/

For the discrepancy in count_conversation_parts and total_count in REST API v2.11:

It looks like this count_conversation_parts includes the source object in the count, but total_count does not include this source object in the count.

So this why count_conversation_parts will be one more than the total_count.

@everron Did you get something different about this statement?

However, I did see larger diffs (ie more than 1 part) between the rows count of parts in the loaded table and the real number of parts (seen directly in the Intercom UI)

everron commented 1 month ago

@aldogonzalez8

I got a similar answer from Intercom but insisted for further investigation since the 1 part difference was misleading.

Did you get something different about this statement?

However, I did see larger diffs (ie more than 1 part) between the rows count of parts in the loaded table and the real number of parts (seen directly in the Intercom UI)

I was not able to reproduce the large diffs in conversation_parts by simply calling their API. I've only seen it in the data loaded by Airbyte

aldogonzalez8 commented 1 month ago

Can you extend on large diffs in conversation_parts? I'm not sure what we are discussing here.

I was not able to reproduce the large diffs in conversation_parts by simply calling their API. I've only seen it in the data loaded by Airbyte

everron commented 1 month ago

Here is an example.

Calling the Intercom API from postman: GET https://api.eu.intercom.io/conversations/<conversation_id>

I get

    ...
    "statistics": {
        ...
        "count_conversation_parts": 56
    },
    ...
    "conversation_parts": {
        "type": "conversation_part.list",
        "conversation_parts": [
            ...
        ],
        "total_count": 55

This is the '1 Diff' issue we were talking about.

Now looking at the number of rows in the conversation_parts table loaded by airbyte with this config: image

Only 52 parts were loaded. image

For that particular conversation that's only at most 3 parts, but for others we're missing sometimes more than half of the conversation. We do not have errors in Airbyte logs

aldogonzalez8 commented 1 month ago

was not able to reproduce the large diffs in conversation_parts by simply calling their API. I've only seen it in the data loaded by Airbyte

That's conversation_parts table count(*), what about data saved in conversations.statistics? What is the count_conversation_parts value?

everron commented 1 month ago

what about data saved in conversations.statistics? What is the count_conversation_parts value?

Same value as in the API response, "count_conversation_parts":56

aaronsteers commented 1 month ago

@aldogonzalez8 and @pnilan - I'm just coming onto the oncall rotation and checking in to this issue. My first glance read is that this might be due to faulty cursor logic - but more research will be needed to confirm/disconfirm.

@everron - Can you say when you first noticed this as an issue, or when you believe the issue started?

bgdanterty commented 1 month ago

@bgdanterty Are you seeing something different? As you mentioned, you get all the data using a Python script.

There is always a difference of one record between the statistics and the actual quantity in the API response, this is true.

In my script and in the airbyte task, if I compare the number of parts in my long conversation, there is a larger discrepancy between the statistics and the number of facts. I think there may be several bugs. Something is wrong with the intercom API but my parts count from custom and airbyte is different too.

here is an example:

Screenshot 2024-09-12 at 11 00 10

We have also begun to migrate from the intercom to another platform, but this will not happen soon and it is still important for us to receive full volumes of data for analysis and load forecasting.

everron commented 1 month ago

@aaronsteers

Can you say when you first noticed this as an issue, or when you believe the issue started?

Hard to tell, because of the statistics count, but I would say between May 10 and May 13 this year, when the % of conversations having different part counts went from ~10% to 60%

@bgdanterty

There's also a hard limit of 500 parts in the conversation API

aaronsteers commented 1 month ago

@everron - re:

There's also a hard limit of 500 parts in the conversation API

Yikes! Is there any workaround that you are aware of? (cc @bgdanterty with the same Q)

If not, we probably should document this as a limitation (if it isn't there already) in our connector docs.

Re:

I think there may be several bugs.

This ☝️ is increasingly my prevailing theory as well.

One possible root cause

On the Airbyte side, I only have one prevailing theory as of now, which is that the connector might be leaking state cursors across the conversations and conversation_parts streams. If this were the case, then certain edge cases, such as the parent stream (conversations) succeeding while the child stream (conversation_parts) was either disabled, failed, or otherwise skipped. If cursor information leaks from one stream to another, then some number of parts might be added to a conversation that would be missed in the sync and then skipped on subsequent syncs.

The best practice is for sources with parent-child streams such as this one to keep stream state cursors distinct for the reasons above, even though the child stream doesn't have its own incremental key and must rely on the parent endpoint to be marked as changed. If we can confirm that the Intercom source is not performing in this way, then it would be a bug on the connector and we would try to prioritize for resolution in the coming cycle. (Unfortunately, I've done a rough scope and this could be a week or two engineering time to resolve. It is not a quick fix and likely we'd want to first confirm the root cause and then move to prioritize a refactor once confirmed.)

Short-term Mitigation

If the above is the root cause (and even if not), a remediation step in this case would be to run a full sync on the child stream (conversation_parts) with the goal to re-sync any conversations who were skipped in the recurring incremental syncs.

@everron, @bgdanterty - would you be interested in attempting this mitigation and reporting back?

everron commented 1 month ago

Thanks @aaronsteers

The 500 parts limit didn't bother us too much, as we had fewer than 10 such conversations (out of thousands in 2024). We will live with it.

I'll try the full sync of the conversation_parts streams and will get back to you

Alexandros-Gk commented 1 month ago

@everron - re:

There's also a hard limit of 500 parts in the conversation API

Yikes! Is there any workaround that you are aware of? (cc @bgdanterty with the same Q)

If not, we probably should document this as a limitation (if it isn't there already) in our connector docs.

Re:

I think there may be several bugs.

This ☝️ is increasingly my prevailing theory as well.

One possible root cause

On the Airbyte side, I only have one prevailing theory as of now, which is that the connector might be leaking state cursors across the conversations and conversation_parts streams. If this were the case, then certain edge cases, such as the parent stream (conversations) succeeding while the child stream (conversation_parts) was either disabled, failed, or otherwise skipped. If cursor information leaks from one stream to another, then some number of parts might be added to a conversation that would be missed in the sync and then skipped on subsequent syncs.

The best practice is for sources with parent-child streams such as this one to keep stream state cursors distinct for the reasons above, even though the child stream doesn't have its own incremental key and must rely on the parent endpoint to be marked as changed. If we can confirm that the Intercom source is not performing in this way, then it would be a bug on the connector and we would try to prioritize for resolution in the coming cycle. (Unfortunately, I've done a rough scope and this could be a week or two engineering time to resolve. It is not a quick fix and likely we'd want to first confirm the root cause and then move to prioritize a refactor once confirmed.)

Short-term Mitigation

If the above is the root cause (and even if not), a remediation step in this case would be to run a full sync on the child stream (conversation_parts) with the goal to re-sync any conversations who were skipped in the recurring incremental syncs.

  • This mitigation doesn't solve the root cause, or prevent in the future. It would however, recover any missing records from past syncs.
  • It also doesn't solve what appear to be faulty or inconsistent logics in the intercom API's statistics counts.
  • It likewise wouldn't solve for the 500 part limit that @everron calls out.

@everron, @bgdanterty - would you be interested in attempting this mitigation and reporting back?

Thanks, @aaronsteers, for bringing this up! It seems likely that there may be an issue with the cursor logic. After doing a full refresh on a separate dataset, we noticed that entire conversations were missing. If it helps with your debugging, the first period when we encountered a spike in missing data was around mid-July, but this could also be related to increased traffic in the tool at the time.

Root Cause Validation:

Is there anything we can do to help validate the root cause (perhaps through logs)?

Short-Term Mitigation:

  1. We tried the full sync, and while it worked perfectly, as you mentioned, it's not a permanent fix since it could happen again.
  2. As time goes on, the full sync will take longer and longer to complete—it took four days in our case.
  3. The 500 parts limit and the known discrepancy on Intercom's side for statistics counts are not issues for us.

Question:

Once the root cause is validated, what would be the ETA for a fix and deployment?

geo909 commented 1 month ago

@aaronsteers Thank you too for looking into this.

Just wanted to add to what my coworker @Alexandros-Gk mentioned, after talking to an Intercom representative:

Just in case it helps to exclude some scenarios while troubleshooting.

everron commented 1 week ago

Hello @aaronsteers Do you have an update on this ?

The full sync was tried on a staging env and even with sampled data, it took several days and we did not consider it for production data. We still have 60 to 70% of conversations extracted every day with missing parts.