Feature Request - make cache compatible with RAW Json Airbyte format

airbytehq / PyAirbyte

PyAirbyte brings the power of Airbyte to every Python developer.

https://docs.airbyte.com/pyairbyte

Other

178 stars 20 forks source link

Feature Request - make cache compatible with RAW Json Airbyte format #167

Open maver1ck opened 3 months ago

maver1ck commented 3 months ago

Hi, Current format of Cache is in line with Airbyte normalized v2 format. This has some drawback like lack of compatibility between PyAirbyte and Airbyte. Also schema evolution is not supported.

Adding Airbyte raw json format support may fix those issues.

aaronsteers commented 3 months ago

@maver1ck - Thanks for logging this suggestion. 🙏

We're actually launching an update today or tomorrow which will add a couple new features:

Missing columns will be auto-added if they are detected missing. (Very basic schema evolution.)
PyAirbyte will add support for the following 3 _airbyte_* columns: _airbyte_raw_id, _airbyte_extracted_at, and _airbyte_meta. This will make PyAirbyte closer to the Dv2 table conventions.

New docs page here show a reference for those not-yet-released features:

https://airbytehq.github.io/PyAirbyte/airbyte/records.html

This is also in the context of all Airbyte destinations currently on their way to the "v2" standard.

What do you think of this path forward? Any concerns?

aaronsteers commented 3 months ago

Decided to expand our docs to better cover this topic:

https://github.com/airbytehq/PyAirbyte/pull/170

aaronsteers commented 3 months ago

v0.9.0 is now live 🚀 : https://github.com/airbytehq/PyAirbyte/releases/tag/v0.9.0

maver1ck commented 3 months ago

@aaronsteers It may help but I really want to get access to _airbyte_data column as we currently are doing normalization by ourselves using raw data.

aaronsteers commented 3 months ago

@maver1ck - I've been thinking about this more, and I'm not sure how we'd implement it. And as noted above, this could be hard to prioritize because destinations are moving to V2 style more broadly across Airbyte.

Can you help me with a few follow-up questions?

Which destination(s) are you using today and looking to match with?
Do you have a preference for the all-data-in-one-column approach, or is it mostly just for compatibility reasons that you are requesting?
Do you use append-only mode when loading, or do you use deduping as well?

And to add a bit more tech detail regarding my question point 3 above: we likely wouldn't be able to support deduping/merge if data was only in the 'raw' column. Merge and update functions will generally require the Primary key columns to be declared as top-level columns (at least, in order to perform well), and similarly with the incremental "cursor" columns - the database can optimize filtering more effectively if those are also defined as top-level columns.

maver1ck commented 2 months ago

Hi @aaronsteers We have multiple reasons for all-data-in-one-column approach. 1) Compatibility reasons. Changing source schema doesn't require changing db schema. 2) Datatypes. We're using dbt to do json to proper table mapping and we can set up proper MySQL types (for example varchar(255) instead of text. 3) We're using append only. Deduplication is done by dbt.

aaronsteers commented 2 months ago

@maver1ck - Which destination connector(s) are you using today in Airbyte Cloud/OSS? And do you have the ability to run docker in your python runtimes?

We have a potential path forward, which would be to add support for destinations:

https://github.com/airbytehq/PyAirbyte/issues/197

Combined with a community contribution WIP:

aaronsteers commented 2 months ago

This came up in slack in conversation with contributor @SebastienN15 :

https://airbytehq.slack.com/archives/C06FZ238P8W/p1715112018711929?thread_ts=1714742768.802949&cid=C06FZ238P8W

And I raised in that slack thread some inherent limitations with the raw json format:

There are some other data engineering challenges that this raises:

Primary keys and cursor keys aren't indexible or optimized for querying and partition elimination.

Because of #1, deduping (merge upserts) is impossible or least less efficient.

Compression is worse because columns aren't typed, leading to more storage usage and worse read/write performance overall.

maver1ck commented 2 months ago

Hi @aaronsteers I'm using MySQL as destination. I will look into this thread and I will come back to you.

maver1ck commented 2 months ago

Hi @aaronsteers I rechecked V2 destinations as they are new addition to MySQL sink.

From what I understand there is still possibility to use raw data stored in airbyte_internal schema. (and we even can Disable Final Tables.)

So my proposal is to mimic this behaviour in cache. Having both:

raw tables
final tables

airbytehq / PyAirbyte

Feature Request - make cache compatible with RAW Json Airbyte format #167

209