airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.14k stars 4.13k forks source link

[EPIC] Validate all output record schemas in the worker before sending to the destination #11279

Closed sherifnada closed 2 years ago

sherifnada commented 2 years ago

Tell us about the problem you're trying to solve

If a source incorrectly declares its schema (e.g; it says the "ID" column is a number when it's really a string) then we only find out about that when the destination fails upon encountering one such record. This has two problems:

  1. It incurs unnecessary cost on the destination system e.g: you sync 1 billion records just fine, then the 1-billion-and-one-th record exhibits the malformed schema, then you will have paid the bill for 1 billion records only for this job to fail
  2. It makes it difficult to understand where the error came from, which hurts connector health visibility

Potentially related to https://github.com/airbytehq/airbyte-internal-issues/issues/2507

Describe the solution you’d like

I would like the Airbyte worker to validate all record schemas before passing them in the destination. If a record mismatches the schema, fail the sync and attribute the failure to the source.

Describe the alternative you’ve considered or used

  1. Checkpointing writes more aggressively, so that you don't need to rewrite 1billion records the next time the sync runs
  2. Validate the schema in the worker and "take note" if the source, and surface that in error logs, but don't actually force-fail the sync. The only upside of this approach is if the destination doesn't care about schema (e.g: schemaless db like mongo)

Steps

cgardens commented 2 years ago

@sherifnada what are the cases we want to fail on? Are the lists below right?

fail on these:

don't fail on these:

cgardens commented 2 years ago

@sherifnada is SAT already testing for this? is our main goal to detect drift? or just catch bugs?

sherifnada commented 2 years ago

@cgardens SAT makes a best effort to test this. However in cases where sandbox accounts aren’t comprehensive it’s not possible to test this for all fields in all streams. So this is to catch both bugs and drift from the API.

Your above list is correct in that it should only fail if a field is present and doesn’t match its declared type. No fields should be considered required.

If this validation fails the failure should be attributed to the source.

cgardens commented 2 years ago

perfect. thanks!

sherifnada commented 2 years ago

It would also be incredibly helpful for debugging if, upon failure, we log:

  1. The stream name and field name
  2. The value which caused the failure
pmossman commented 2 years ago

Grooming notes:

Are there implementation details that need to be spec'd out as well as the performance considerations?

cgardens commented 2 years ago

@sherifnada can you give us context into how much customer pain and OC time this is causing? For context, we are trying to reason about what performance tradeoffs we are willing to take in this implementation. The more pain it is causing the more tolerant we are of performance hit.

lmossman commented 2 years ago

Note from most recent backlog grooming:

sherifnada commented 2 years ago

@lmossman that sounds like a great path forward

bleonard commented 2 years ago

Example bug I saw: https://github.com/airbytehq/airbyte/issues/9775