Open avirajsingh7 opened 6 days ago
Recent changes introduce Arrow Dataset support to the Airbyte codebase. This involves adding methods to convert datasets into Arrow format in multiple classes, including BaseCache
and SQLDataset
, which allow chunk-based processing for efficient data handling. These additions enhance functionality and align with the feature request to natively support Apache Arrow, reducing dependency on Pandas conversions.
File | Change Summary |
---|---|
airbyte/caches/base.py | Introduced get_arrow_dataset method in BaseCache class to return Arrow Datasets by converting Pandas chunks to Arrow Tables. |
airbyte/datasets/_base.py | Added to_arrow method in the base class to be implemented by subclasses for Arrow Dataset representation, raising NotImplementedError by default. |
airbyte/datasets/_sql.py | Added to_arrow method in SQLDataset class to return an Arrow Dataset using specified chunk size, enhancing data processing capabilities with pyarrow . |
pyproject.toml | Added pyarrow dependency with version ^16.1.0 to support new Arrow Dataset functionalities. |
tests/integration_tests/test_all_cache_types.py | Updated test_faker_read() function to include tests for to_arrow method, performing row count and batch iteration checks on the Arrow dataset. |
sequenceDiagram
participant User
participant BaseCache
participant SQLDataset
participant ArrowDataset
User->>BaseCache: get_arrow_dataset(stream_name, chunksize)
BaseCache->>SQLDataset: to_arrow(chunksize)
SQLDataset->>ArrowDataset: create from chunks
ArrowDataset-->>User: return Arrow Dataset
Objective | Addressed | Explanation |
---|---|---|
Add to_arrow() on Dataset class (#204) |
✅ |
[!TIP]
Early access features
- OpenAI `gpt-4o` model for reviews and chat. Note: - You can disable early access features from the CodeRabbit UI or by setting `early_access: false` in the CodeRabbit configuration file. - Please join our [Discord Community](https://discord.com/invite/GsXnASn26c) to provide feedback and report issues. - OSS projects are always opted into early access features.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
@coderabbitai review
@avirajsingh7 - Auto-fix applied some lint fixing. And the other fixes you applied looked great. Lmk if this is ready for final review or if you are still applying additional changes.
@aaronsteers it is ready for review
/test-pr
❌ Tests failed.
@avirajsingh7 - Looks like we have a warning here regarding numpy
implementation:
https://github.com/airbytehq/PyAirbyte/actions/runs/9726180729/job/26844327030?pr=281
PyAirbyte treats warnings as fatal in our test suite, so we'd need to resolve the condition creating this warning or else add it to the list of ignored warnings in pyproject.toml
.
I noticed that NumPy 2.0 just came out and re-locking dependencies (or adding PyArrow) has bumped us to 2.0 version in poetry.lock
. We could try adding a numpy constraint "<=2.0" if that is the source of the issue - but ignoring can also be fine. This SO Answer seems to suggest the issue is fairly innocuous - while also giving some tips that might help resolve.
@avirajsingh7 - Looks like we have a warning here regarding
numpy
implementation:Details https://github.com/airbytehq/PyAirbyte/actions/runs/9726180729/job/26844327030?pr=281
![]()
PyAirbyte treats warnings as fatal in our test suite, so we'd need to resolve the condition creating this warning or else add it to the list of ignored warnings in
pyproject.toml
.I noticed that NumPy 2.0 just came out and re-locking dependencies (or adding PyArrow) has bumped us to 2.0 version in
poetry.lock
. We could try adding a numpy constraint "<=2.0" if that is the source of the issue - but ignoring can also be fine. This SO Answer seems to suggest the issue is fairly innocuous - while also giving some tips that might help resolve.
@aaronsteers Whatever you recommend, I can go with that, we can try adding constraint, if adding constraint solve this well and good, Else it is innocuous we can also ignore this (Gpt Response)
Resolves #204
As discuss in issue added a supoort pyarrow dataset instead of pyarrow table to handle a large datasets.
Supporting docs: pyarrow dataset pyarrow.Table.from_pandas
Still need input on optimal chunksize...
Summary by CodeRabbit
New Features
Dependencies
pyarrow
dependency (version^16.1.0
).