airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.05k stars 4.11k forks source link

[source-klaviyo] Excessive disk use due to request cache #40646

Open Steffen911 opened 4 months ago

Steffen911 commented 4 months ago

Connector Name

source-klaviyo

Connector Version

2.7.2

What step the error happened?

During the sync

Relevant information

When running the connector for the events table we see it filling up an events.sqlite file in the /tmp directory of the source container. After running for approximately 3 days the file grew to a total of 41GB. This causes DiskPressure on our infrastructure nodes.

This seems to be the requests-cache library with no eviction setting, i.e. all records ever received are stored on disk.

We would like to have a way to evict the cache at runtime or decide to not use a cache at all as part of the connector configuration.

I tried to delete the events.sqlite file, delete from the table, and clear the cache using the requests-cache library in a python script. In all cases, the import failed with an error due to issues with the database.

Relevant log output

No response

Contribute

marcosmarxm commented 4 months ago

Thank you for reporting the issue, @Steffen911. I have added it to the connector backlog for future resolution.

bflammers commented 2 months ago

We have also encountered this issue in connector version 2.7.2 and did a roll-back.

I see the connector is now at version 2.9.1. Has this issue been fixed in the mean time or is it still present in the latest version?

Steffen911 commented 2 months ago

@lazebnyi I see that https://github.com/airbytehq/airbyte/pull/40608 touches on performance. Do you happen to know whether it also addresses the issue here?

MoralesPablo commented 2 months ago

@bflammers @Steffen911 did the version 2.9.1 work for you? Is there a performance improvement?

We are having daily data extractions, which are taking more than 70 hours to get the data. It's impossible to work with Klaviyo data having this super slowly data extractions.

bflammers commented 2 months ago

@MoralesPablo No unfortunately it's still there in 2.9.4

To give an indication: incremental syncs (daily about 40k records) used to take 10 minutes. After the upgrade it's taking 90 (!) minutes