Feat: Align internal Airbyte columns with Dv2; Feat: Auto-add missing columns to Cache tables.

aaronsteers commented 3 months ago

Resolves: #14 Resolves: #118

This update does a couple things:

Adds these internal columns to all PyAirbyte datasets and cache tables:
- _airbyte_emitted_at - The timestamp corresponding to emitted_at in the Airbyte Record message.~~
- ~~_airbyte_loaded_at - The timestamp representing utcnow() at the time the record is received by PyAirbyte.~~ - removed per comments below
- _airbyte_meta - For now, defaults to an empty dictionary.
- _airbyte_raw_id - A unique ID per record, assigned as it is received from source. (Unlike Dv2, there is no raw table but I kept the same column name for consistency.)
Auto-adds any missing columns to cache tables if they are declared in the schema. This including the above-named internal columns.

As of this iteration, PyAirbyte will not...

...will not attempt to expand or change column types if they are incorrect or stale.
~~...will not attempt to align _airbyte_loaded_at with the batch load time.~~
- As noted above, the time is when we first start processing the record. In the future, we can try to set this during load time, but doing so is fragile and may not be possible in a universal/generic manner. (Removed, so no longer relevant.)
...will not try to populate _airbyte_meta with any data.

Tests

[x] I've added an integration test across all 4 SQL platforms (DuckDB, Postgres, Snowflake, BigQuery) which confirms that PyAirbyte can correctly add declared columns when they are detected missing.

aaronsteers commented 3 months ago

/test-pr Job started... Check job output.

✅ Changes applied successfully. 🟦 Job completed successfully (no changes).

aaronsteers commented 3 months ago

/test-pr Job started... Check job output.

🟦 Job completed successfully (no changes). Job started... Check job output.

🟦 Job completed successfully (no changes).

aaronsteers commented 3 months ago

/test-pr

Job started... Check job output.

🟦 Job completed successfully (no changes).

aaronsteers commented 3 months ago

/test-pr

PR test job started... Check job output.

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes). PR test job started... Check job output.

❌ Tests failed.

aaronsteers commented 3 months ago

/test-pr

PR test job started... Check job output.

❌ Tests failed.

aaronsteers commented 3 months ago

/fix-pr

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

aaronsteers commented 3 months ago

@jbfbell, @evantahler - Cc'ing you here for visibility and to confirm the direction...

This PR gets us closer to compatibility with the Dv2 specs, which I've documented / linked to in the related issue:

14

A couple callouts/questions:

Should we just forget about _airbyte_loaded_at? We don't use a long-running raw table and Dv2 doesn't include it in its final tables. Initially I wanted to include, but I don't know that it adds enough value to warrant a split.
- I'd love to see Dv3 adding _airbyte_loaded_at into the final table also, and (as discussed) consider dropping the long-running raw table, but again, probably not a high enough priority, and its easy for us to just ignore this column for the time being.
We are lower-casing all columns and table names as of now, but according to Notion, I think you are preserving case for all except Snowflake - is that correct? (Ignoring DuckDB, the others we have covered here are BigQuery and Postgres.)

An important caveat:

We're not expecting that the user will be able to simply "continue" a sync from PyAirbyte in Airbyte Cloud. If they want to use the exact same table names, this will probably mean a 'reset' after migrating their jobs to Cloud. The important thing is that their downstream processing shouldn't have to change, or shouldn't have to change much. (Those downstream transformations and pipelines can represent 10+ or in some cases 100+ hours of effort, so as much as we can avoid disrupting it, the better. 🤞)

Update: I just decided to go ahead and drop the loaded-at column and I've added an implementation of _airbyte_raw_id which can be used as a unique identifier per row/record.

evantahler commented 3 months ago

Nice work AJ!

Should we just forget about _airbyte_loaded_at? We don't use a long-running raw table and Dv2 doesn't include it in its final tables. Initially I wanted to include, but I don't know that it adds enough value to warrant a split.

We made a funny choice with Dv2 which tried to balance "give the users lots of data about the sync" with "users seem to be mad when we make too many new columns". That's how we ended up showing extracted_at and not loaded_at in the final table. extracted_at is more logically useful as the source time matters far more for analysis - probably what you are going to use in your analysis. Users can still get to loaded_at via the join to the raw table, but we figured that was rarer.

I'd love to see Dv3 adding _airbyte_loaded_at into the final table also, and (as discussed) consider dropping the long-running raw table, but again, probably not a high enough priority, and its easy for us to just ignore this column for the time being.

So the long-running raw table are getting even more important now for the refresh work, especially as we are merging records across generations, and to power truncating refreshes without data-towntime. I think they are with us for the long haul now. If none of those words made sense (because we made them up recently), check out this draft of the updated user-facing doc and join #1-point-0-refrehes.

We are lower-casing all columns and table names as of now, but according to Notion, I think you are preserving case for all except Snowflake - is that correct? (Ignoring DuckDB, the others we have covered here are BigQuery and Postgres.)

Casse-sensitivity is kind of per-destination now. I'll let @jbfbell take that one, as I don't have the latest info any more.

aaronsteers commented 3 months ago

@evantahler - Thanks for the review and for this context.

I'll drop _airbyte_loaded_at from scope.

This PR doesn't touch capitalization rules so I think we're ok taking any needed changes to those in a subsequent pass.

@jbfbell - Let me know if you have any other questions/concerns here. I think we're good to go with the addition of _airbyte_extracted_at, _airbyte_meta (always {} for now) and _airbyte_raw_id.

This PR doesn't dive into what to do with other special characters, but if you can point me at source code or a Notion/Docs page, I'll take on any other name normalization rules in a subsequent PR.

Thanks, both!

aaronsteers commented 3 months ago

/fix-pr

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

aaronsteers commented 3 months ago

@jbfbell and @bindipankhudi - If one or both of you is available to review/approve tomorrow, I'll go ahead and merge when ready.

All tests are passing now. ✅ 😄

airbytehq / PyAirbyte

Feat: Align internal Airbyte columns with Dv2; Feat: Auto-add missing columns to Cache tables. #144

Tests

14