airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.68k stars 4.02k forks source link

Source Intercom: Support high volume syncs #11595

Open sherifnada opened 2 years ago

sherifnada commented 2 years ago

Tell us about the problem you're trying to solve

A user was trying to sync a high volume instance of Intercom (logs below). The connector spent many hours (50+ hours) syncing data from the contacts stream. This is a bad user experience as it does not allow them to make use of the product and data quickly. logs-97415.txt

Note that this issue is not just trying to do this for intercom, but should be used as a learning opportunity for how this can be done at the CDK level as described in airbytehq/airbyte-internal-issues#504

Describe the solution you’d like

I would like us to find a way to speed up intercom syncs coming from high volume instances such as this one. Ideally a sync takes no longer than a couple of hours in 99% of cases.

lazebnyi commented 2 years ago

@sherifnada did you think issue with performance connected is source issue.

How I understand @alafanechere talk about that here: https://github.com/airbytehq/airbyte/issues/12671#issuecomment-1119632804

sherifnada commented 2 years ago

@lazebnyi this should not block certifications atm

misteryeo commented 2 years ago

Team, let's pick this back up alongside an investigation of: https://github.com/airbytehq/oncall/issues/274. Please reach out to @sherifnada when you dig in to gain access to the impacted workspace.

bazarnov commented 2 years ago

@sherifnada Can I have the creds to this high volume data account to proceed with tests?

marcosmarxm commented 2 years ago

Another complain in Discourse: https://airbyte7538.zendesk.com/agent/tickets/1459

And from other Intercom issue in github https://github.com/airbytehq/airbyte/issues/12506 looks contact took 15h to finished, in this case the stream has the majority of data (1mm records)

2022-05-02 00:53:35 source > Read 1002750 records from contacts stream
2022-05-02 00:53:35 source > Finished syncing contacts
2022-05-02 00:53:35 source > SourceIntercom runtimes:
Syncing stream admins 0:00:02.460130
Syncing stream contacts 15:01:43.301241
2022-05-02 00:53:35 source > Syncing stream: tags 
2022-05-02 00:53:37 INFO i.a.w.DefaultReplicationWorker(lambda$getReplicationRunnable$5):301 - Records read: 1004000 (1 GB)
2022-05-02 00:53:37 source > Read 1268 records from tags stream
2022-05-02 00:53:37 source > Finished syncing tags
2022-05-02 00:53:37 source > SourceIntercom runtimes:
Syncing stream admins 0:00:02.460130
Syncing stream contacts 15:01:43.301241
Syncing stream tags 0:00:01.593165
marcosmarxm commented 2 years ago

Zendesk ticket #1459 has been linked to this issue.

marcosmarxm commented 2 years ago

Comment made from Zendesk by Marcos Marx on 2022-07-05 at 12:28:

Hello Alelxis, there is one issue in Github https://github.com/airbytehq/airbyte/issues/11595 about improving Intercom speed. I saw the code implementation and this stream doesn't have any special code compared to others streams (companies, tags, segments). In any case I'll return to you when the issue is resolved.
IzioDev commented 2 years ago

We disabled incremental for the contacts stream, swapped from /contacts/search (POST) to /contacts (GET) and this solves the request throttle.

bazarnov commented 2 years ago

We disabled incremental for the contacts stream, swapped from /contacts/search (POST) to /contacts (GET) and this solves the request throttle.

How many records do you have for contacts stream?

IzioDev commented 2 years ago

We disabled incremental for the contacts stream, swapped from /contacts/search (POST) to /contacts (GET) and this solves the request throttle.

How many records do you have for contacts stream?

More than 9Gb according to logs

sherifnada commented 2 years ago

Potentially one promising direction here is to use the export functionality of the intercom API. More information here: https://developers.intercom.com/intercom-api-reference/reference/export-job-model

mrhallak commented 2 years ago

@sherifnada We are currently facing this issue with company_segments taking at least 12 hours

bazarnov commented 1 year ago

@sherifnada The link https://developers.intercom.com/intercom-api-reference/reference/export-job-model is not available. Instead, this one works fine: https://developers.intercom.com/intercom-api-reference/reference/the-export-job-model

the Export-Jobs are available for the Messages stream only and used along with the Unstable API version. More context here: https://github.com/airbytehq/airbyte/issues/9188#issuecomment-1422553673

Unfortunately, we cannot use it for all streams available for now.

@mrhallak As for the company_segments stream, it's slow in its nature, since depends on the Companies stream. Both of them don't allow filtering out the records on the API side, thus we have to fetch all of the data from both of them and then filter the latest. There is no workaround for this right now.

The general speed of the connector has already been tuned to its max, considering rate limits and caching strategy. The other option is to make dependent streams call their endpoints in async mode (in the theory of course)