Closed aaronsteers closed 3 months ago
/test-pr Job started... Check job output.
β Changes applied successfully. π¦ Job completed successfully (no changes).
/test-pr Job started... Check job output.
π¦ Job completed successfully (no changes). Job started... Check job output.
π¦ Job completed successfully (no changes).
/test-pr
PR test job started... Check job output.
PR auto-fix job started... Check job output.
π¦ Job completed successfully (no changes). PR test job started... Check job output.
β Tests failed.
@jbfbell, @evantahler - Cc'ing you here for visibility and to confirm the direction...
This PR gets us closer to compatibility with the Dv2 specs, which I've documented / linked to in the related issue:
A couple callouts/questions:
_airbyte_loaded_at
? We don't use a long-running raw table and Dv2 doesn't include it in its final tables. Initially I wanted to include, but I don't know that it adds enough value to warrant a split.
_airbyte_loaded_at
into the final table also, and (as discussed) consider dropping the long-running raw table, but again, probably not a high enough priority, and its easy for us to just ignore this column for the time being.An important caveat:
Update: I just decided to go ahead and drop the loaded-at column and I've added an implementation of _airbyte_raw_id
which can be used as a unique identifier per row/record.
Nice work AJ!
Should we just forget about _airbyte_loaded_at? We don't use a long-running raw table and Dv2 doesn't include it in its final tables. Initially I wanted to include, but I don't know that it adds enough value to warrant a split.
We made a funny choice with Dv2 which tried to balance "give the users lots of data about the sync" with "users seem to be mad when we make too many new columns". That's how we ended up showing extracted_at
and not loaded_at
in the final table. extracted_at
is more logically useful as the source time matters far more for analysis - probably what you are going to use in your analysis. Users can still get to loaded_at
via the join to the raw table, but we figured that was rarer.
I'd love to see Dv3 adding _airbyte_loaded_at into the final table also, and (as discussed) consider dropping the long-running raw table, but again, probably not a high enough priority, and its easy for us to just ignore this column for the time being.
So the long-running raw table are getting even more important now for the refresh work, especially as we are merging records across generations, and to power truncating refreshes without data-towntime. I think they are with us for the long haul now. If none of those words made sense (because we made them up recently), check out this draft of the updated user-facing doc and join #1-point-0-refrehes
.
We are lower-casing all columns and table names as of now, but according to Notion, I think you are preserving case for all except Snowflake - is that correct? (Ignoring DuckDB, the others we have covered here are BigQuery and Postgres.)
Casse-sensitivity is kind of per-destination now. I'll let @jbfbell take that one, as I don't have the latest info any more.
@evantahler - Thanks for the review and for this context.
I'll drop _airbyte_loaded_at
from scope.
This PR doesn't touch capitalization rules so I think we're ok taking any needed changes to those in a subsequent pass.
@jbfbell - Let me know if you have any other questions/concerns here. I think we're good to go with the addition of _airbyte_extracted_at
, _airbyte_meta
(always {}
for now) and _airbyte_raw_id
.
This PR doesn't dive into what to do with other special characters, but if you can point me at source code or a Notion/Docs page, I'll take on any other name normalization rules in a subsequent PR.
Thanks, both!
@jbfbell and @bindipankhudi - If one or both of you is available to review/approve tomorrow, I'll go ahead and merge when ready.
All tests are passing now. β π
Resolves: #14 Resolves: #118
This update does a couple things:
_airbyte_emitted_at
- The timestamp corresponding toemitted_at
in the Airbyte Record message.~~- removed per comments below_airbyte_loaded_at
- The timestamp representingutcnow()
at the time the record is received by PyAirbyte._airbyte_meta
- For now, defaults to an empty dictionary._airbyte_raw_id
- A unique ID per record, assigned as it is received from source. (Unlike Dv2, there is no raw table but I kept the same column name for consistency.)As of this iteration, PyAirbyte will not...
...will not attempt to align_airbyte_loaded_at
with the batch load time.As noted above, the time is when we first start processing the record. In the future, we can try to set this during load time, but doing so is fragile and may not be possible in a universal/generic manner.(Removed, so no longer relevant.)_airbyte_meta
with any data.Tests