Selection of (nested) subset of fields of streams to sync (e.g. avoid PII)

ns-admetrics commented 3 years ago

Apologies if this is a dupe or already possible; would close!

Tell us about the problem you're trying to solve

Some sources have APIs that (by default) include PII in their response, which one might want to avoid or minimize pulling at all. Either just to keep the data pull minimal (this goes against EL-T, I know I know :sweat_smile:, but PII and legal are different...), or to respect data processor agreements that may be in place.

The API responses can be quite nested, for instance a "customer" object may have an array of "addresses". One might want a numerical "customer ID", but not "email", and the "country" in each address but not "street", etc. etc.

Related but orthogonal: https://github.com/airbytehq/airbyte/issues/1758 (pull sensitive fields, but transform them in-flight, so they do not land on disk -- even though hashed PII can still be PII, data processor agreements can make exceptions for such cases if they are unavoidable).

Describe the solution you’d like

Data catalog gains knowledge (if it's not already there) on which fields are technically mandatory / optional to sync for each API
Data source connection settings page has tick boxes for each stream field, to include or not
Nested stream fields (arrays/maps) can be unfolded, to reveal inner fields, which again be selected or not via tick boxes

If I understand correctly, this can be configured e.g. in Singer tap catalogs, the task is to visualize and sync settings.

Likely GraphQL APIs would tend to allow full configuration, while REST APIs may only in some cases allow selecting which fields to return.

Potentially, connectors themselves would classify each field as "is/contains PII" yes/no; this would allow a global tick box "deselect all PII"

Your enterprise edition might go beyond selection and add enforcement (so only specific users can change settings).

Describe the alternative you’ve considered or used

Pull all the data and filter in a DBT step: This would go against some data processor agreements, and open the door to abuse or mistakes
Configure source manually in container / work with forked source connectors: Workaround only for self-hosted; needs to be repeated for every data source

┆Issue is synchronized with this Asana task by Unito

ChristopheDuong commented 3 years ago

Yes this might be duplicate from https://github.com/airbytehq/airbyte/issues/886 as we don't handle nested streams very well yet

ns-admetrics commented 3 years ago

I'd disagree this is a dupe. While that ticket is about post-processing (e.g., likely about generically extracting nested data into separate tables), this ticket is about

sources declaring their data fully
fetching only subsets of the data

That is, not: fetch > normalize > filter, but fetch-subset directly.

It might make sense to focus this ticket on the underlying PII processing issue, as for non-sensitive data, filtering normalized data is likely easier and preferable.

ChristopheDuong commented 3 years ago

Yes, we have multiple issues at the moment, that's why this hasn't been fixed yet because it requires some changes in multiple places in Airbyte...

frontend truncates the nesting in the catalog so the correct catalog is not persisted / sent to the worker

The catalog is the metadata being fetched by the source so the UI can do the fetch-subset as you describe

once, the data is filtered into subsets and sent by the source in some raw json blob that is being persisted by the destination in raw tables, we'll have to adapt the post-processing normalize that will extract it in different tables.

ChristopheDuong commented 3 years ago

This is also reported in #1315 and I've been trying to link all these issues to a common issue #886 for the moment to gather the different use cases

ns-admetrics commented 3 years ago

Right, https://github.com/airbytehq/airbyte/issues/1315 could be seen as a general / parent of this issue, which would then focus on the PII question.

michel-tricot commented 3 years ago

100% on board with this issue.

The guarantee that we want to offer is that: the unsafe data NEVER makes it to the destination.

There are two changes that will make it possible:

1758 (in case you want to keep a pseudo-anonymized version of the the value)
Allow the selection of fields in the connection

The transfo piece would likely happen at the worker level while the selection piece should be at the source level (and using the worker as a safety net)

WDYT?

ns-admetrics commented 3 years ago

Yeah. The worker may need to be more than a safety net though, as a source may not let you pull data of interest without adding data you want to avoid. As a super-specific example, in the Shopify REST API, you can configure top-level fields send/no-send, but not deeper. So e.g. to get order.customer.id, you need to pull all of order.customer, but that also contains order.customer.email.

In other words, the worker may need to pull more than necessary, and filter before persisting to disk.

Perhaps this is an edgy corner case, if there are on-disk caches :)

misteryeo commented 2 years ago

Issue was linked to Harvestr Discovery: Hashing PII fields

airbytehq / airbyte