Open maver1ck opened 3 months ago
@maver1ck - Thanks for logging this suggestion. 🙏
We're actually launching an update today or tomorrow which will add a couple new features:
_airbyte_*
columns: _airbyte_raw_id
, _airbyte_extracted_at
, and _airbyte_meta
. This will make PyAirbyte closer to the Dv2 table conventions.New docs page here show a reference for those not-yet-released features:
This is also in the context of all Airbyte destinations currently on their way to the "v2" standard.
What do you think of this path forward? Any concerns?
Decided to expand our docs to better cover this topic:
v0.9.0 is now live 🚀 : https://github.com/airbytehq/PyAirbyte/releases/tag/v0.9.0
@aaronsteers
It may help but I really want to get access to _airbyte_data
column as we currently are doing normalization by ourselves using raw data.
@maver1ck - I've been thinking about this more, and I'm not sure how we'd implement it. And as noted above, this could be hard to prioritize because destinations are moving to V2 style more broadly across Airbyte.
Can you help me with a few follow-up questions?
And to add a bit more tech detail regarding my question point 3 above: we likely wouldn't be able to support deduping/merge if data was only in the 'raw' column. Merge and update functions will generally require the Primary key columns to be declared as top-level columns (at least, in order to perform well), and similarly with the incremental "cursor" columns - the database can optimize filtering more effectively if those are also defined as top-level columns.
Hi @aaronsteers
We have multiple reasons for all-data-in-one-column approach.
1) Compatibility reasons. Changing source schema doesn't require changing db schema.
2) Datatypes. We're using dbt
to do json to proper table mapping and we can set up proper MySQL types (for example varchar(255)
instead of text
.
3) We're using append only. Deduplication is done by dbt
.
@maver1ck - Which destination connector(s) are you using today in Airbyte Cloud/OSS? And do you have the ability to run docker in your python runtimes?
We have a potential path forward, which would be to add support for destinations:
Combined with a community contribution WIP:
This came up in slack in conversation with contributor @SebastienN15 :
And I raised in that slack thread some inherent limitations with the raw json format:
There are some other data engineering challenges that this raises:
- Primary keys and cursor keys aren't indexible or optimized for querying and partition elimination.
- Because of #1, deduping (merge upserts) is impossible or least less efficient.
- Compression is worse because columns aren't typed, leading to more storage usage and worse read/write performance overall.
Hi @aaronsteers I'm using MySQL as destination. I will look into this thread and I will come back to you.
Hi @aaronsteers I rechecked V2 destinations as they are new addition to MySQL sink.
From what I understand there is still possibility to use raw data stored in airbyte_internal schema. (and we even can Disable Final Tables.)
So my proposal is to mimic this behaviour in cache. Having both:
Hi, Current format of Cache is in line with Airbyte normalized v2 format. This has some drawback like lack of compatibility between PyAirbyte and Airbyte. Also schema evolution is not supported.
Adding Airbyte raw json format support may fix those issues.