[EPIC] Validate all output record schemas in the worker before sending to the destination

sherifnada commented 2 years ago

Tell us about the problem you're trying to solve

If a source incorrectly declares its schema (e.g; it says the "ID" column is a number when it's really a string) then we only find out about that when the destination fails upon encountering one such record. This has two problems:

It incurs unnecessary cost on the destination system e.g: you sync 1 billion records just fine, then the 1-billion-and-one-th record exhibits the malformed schema, then you will have paid the bill for 1 billion records only for this job to fail
It makes it difficult to understand where the error came from, which hurts connector health visibility

Describe the solution you’d like

I would like the Airbyte worker to validate all record schemas before passing them in the destination. If a record mismatches the schema, fail the sync and attribute the failure to the source.

Describe the alternative you’ve considered or used

Checkpointing writes more aggressively, so that you don't need to rewrite 1billion records the next time the sync runs
Validate the schema in the worker and "take note" if the source, and surface that in error logs, but don't actually force-fail the sync. The only upside of this approach is if the destination doesn't care about schema (e.g: schemaless db like mongo)

Steps

[x] https://github.com/airbytehq/airbyte/issues/11603
[ ] Implementation plan/issues created after spike

cgardens commented 2 years ago

@sherifnada what are the cases we want to fail on? Are the lists below right?

fail on these:

value doesn't match declare type (e.g. says int but is string)

don't fail on these:

unknown field
field missing (maybe doesn't apply all fields are not required?)

cgardens commented 2 years ago

@sherifnada is SAT already testing for this? is our main goal to detect drift? or just catch bugs?

sherifnada commented 2 years ago

@cgardens SAT makes a best effort to test this. However in cases where sandbox accounts aren’t comprehensive it’s not possible to test this for all fields in all streams. So this is to catch both bugs and drift from the API.

Your above list is correct in that it should only fail if a field is present and doesn’t match its declared type. No fields should be considered required.

If this validation fails the failure should be attributed to the source.

cgardens commented 2 years ago

perfect. thanks!

sherifnada commented 2 years ago

It would also be incredibly helpful for debugging if, upon failure, we log:

The stream name and field name
The value which caused the failure

pmossman commented 2 years ago

Grooming notes:

incorrect record type validation that happens in the sync worker, fail loudly in an obvious way
We already fail when this happens, but it happens way downstream and is hard to trace back to the root cause
Question: performance impact on syncing large amounts of data? How to do this performantly? Do we sample the data to avoid validating every single message?
We're already doing Json deserialization for every message, so this validation might not be a dominating inefficiency, for what its worth
In the definition of done, we should explicitly evaluate the performance impact of this, we should try to have the real numbers on this.
Do we have existing performance benchmarks we can use for this? Connectors team has this, LiRen is the lead on that.
Is this mostly for API sources? Or can it be an issue for databases as well?
- Can happen if a column type changes or something in the database. Catalog doesn't always get refreshed between syncs. Would we need to refresh the catalog and validate against the refreshed catalog?
- That wouldn't stop the pathological case, the column type could always change mid-sync
- Davin: Ideally we can turn this on and off for particular connectors, and only turn this on for connectors that we know are problematic (to reduce the performance impact). We probably don't need to validate records from the major big APIs because we don't expect it to happen. For smaller, less reliable APIs, issues are more likely.
- Charles: intuition is that the worker won't be the bottleneck, but we're pretty blind today. We should be able to profile worker performance better

Are there implementation details that need to be spec'd out as well as the performance considerations?

If we just implemented this naively and ran the benchmarks to see if there's a significant impact, is that a decent path forward?
Scope: as a Spike, come up with an implementation, should be fairly straightforward, and figure out how to measure the impact. This is a top-3 item on the connectors team so the priority of this is high.
Do we have an 'acceptable performance hit' in mind? 1% of total sync time? 50%? (probably not 50%). Probably a function of how much OC pain/time is spent.

cgardens commented 2 years ago

@sherifnada can you give us context into how much customer pain and OC time this is causing? For context, we are trying to reason about what performance tradeoffs we are willing to take in this implementation. The more pain it is causing the more tolerant we are of performance hit.

lmossman commented 2 years ago

Note from most recent backlog grooming:

We may want to first have this send a metric to segment (or log these cases) whenever a validation fails rather than actually failing the job, to determine how often this is happening. If this seems to be affecting a large percentage of connections, we should go back to the connectors team and ask them what they want to do with this.
We should validate the performance of this by trying out a non-trivial source sync (e.g. Snowflake) and comparing the performance before and after this change. If the performance impact is non-trivial, we should go back to connectors to ask if they still want to move forward with the change

sherifnada commented 2 years ago

@lmossman that sounds like a great path forward

bleonard commented 2 years ago

Example bug I saw: https://github.com/airbytehq/airbyte/issues/9775

airbytehq / airbyte