airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.13k stars 4.12k forks source link

PoC concurrent salesforce #30950

Closed girarda closed 1 year ago

girarda commented 1 year ago

What

The concurrent CDK is developed using the source-stripe as a first connector. Stripe is a good first use case because it is straightforward.

We know we also want to speed up the salesforce connector, which is significantly more complicated.

I hypothesize that the structure of the concurrent CDK will allow the sales-force connector to leverage it, but we should derisk this with a PoC

How

We don’t need to implement custom partitions for the salesforce connector. It should be enough to use the legacy adapter and wrap the streams with StreamFacade.create_from_legacy_stream

Acceptance criteria

Either:

girarda commented 1 year ago

grooming notes:

maxi297 commented 1 year ago

PoC Results

https://www.loom.com/share/4d8dd2388e4d4a30a9f912a2a57a16df

Concerns

Notes:

tmp% grep " records from " salesforce_before.jsonl {"type": "LOG", "log": {"level": "INFO", "message": "Read 42 records from Account stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 1210 records from ActiveFeatureLicenseMetric stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 3254 records from ActivePermSetLicenseMetric stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 4536 records from ActiveProfileMetric stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 19 records from AppDefinition stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 25 records from Asset stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 399 records from FormulaFunctionAllowedType stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 1924 records from ObjectPermissions stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 5281 records from PermissionSetTabSetting stream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Read 3 records from LeadHistory stream"}}


* Streams relying on bulk/jobs are sleeping. We could still end-up with all our workers sleeping so there would still be place for improvement but I don't think it should be a focus for us
girarda commented 1 year ago

Additional acceptance criteria that came up during grooming today:

maxi297 commented 1 year ago

Note that all the streams that are not queryable seem to be instantiated the same way (see SourceSalesforce.generate_streams). The streams we have in our test environment cover only 3 of the 4 types of possible streams for salesforce i.e. IncrementalRestSalesforceStream, BulkSalesforceStream, BulkIncrementalSalesforceStream. RestSalesforceStream is not covered by either config.json or config_bulk.json. Hence we my lack a bit of visibility here

As for rate limiting, Salesforce use a rolling window of 24 hours that has a base number of allowed request depending if it is Developer Edition or Salesforce Edition. On top of that, Salesforce will increase the number of requests on the Salesforce Edition based on how many license you have and what type of license they are. For example, one "Customer Community Plus" and one "External Identity 25,000 SKU" will allow you to perform 70 200 requests on top of the 100 000 from the Salesforce Edition. source.

Random thoughts on rate limiting: