airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.72k stars 3.78k forks source link

Source CockroachDB: fails at >10M records #13421

Open eembees opened 2 years ago

eembees commented 2 years ago

Environment

Current Behavior

When importing a 18.1Gb (57M Records) table from CRDB using the Cockroach connector, the sync job fails. Before failing, the adapter fills all available RAM before the container shuts down: (presumably an OOM).

This does not happen on smaller tables with about 1M records, with the same source and destination:

Screenshot 2022-06-02 at 15 24 02

Expected Behavior

Airbyte should

  1. not throw OOM when transferring the data
  2. allow for manual setting of batch size if dynamic batch size fails

Logs

image

logs-5.txt

Steps to Reproduce

  1. Spin up airbyte, postgres and crdb (open source containers) as described in respective docs
  2. Populate CRDB
  3. Connect CRDB cluster to airbyte
  4. attempt to transfer a large table (20gB or 60M records)

Are you willing to submit a PR?

I have no experience with Java, but any Python / Go code we can work with.

marcosmarxm commented 2 years ago

Discussion in Discourse herE: https://discuss.airbyte.io/t/cockroachdb-source-connector-failures-at-10m-records/1261

eembees commented 11 months ago

I have found this issue, which seems to have the same problem: https://github.com/airbytehq/airbyte/issues/25276 I expect both these issues to be fixed by adding a LIMIT clause as suggested by @lukeasrodgers here https://github.com/airbytehq/airbyte/blob/f23b3ad88ed56c605ea1ccad857e287b7a38dbe0/airbyte-integrations/connectors/source-jdbc/src/main/java/io/airbyte/integrations/source/jdbc/AbstractJdbcSource.java#L113 .

tomerpeled commented 3 months ago

Any updates on this issue?