DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

PFB schema is inconsistent for non-schema AnVIL tables #6678

Open nadove-ucsc opened 4 days ago

nadove-ucsc commented 4 days ago

We derive the PFB from the AnVIL schema where possible, but for non-schema tables, we fall back to our old approach of building the schema dynamically based on the observed shape of the replicas' contents. These dynamic schemas may differ from one manifest to the next if the replicas in one manifest exhibit shapes not observed in the other.

Since the replica shapes are constrained by the BigQuery table schema, I expect that the only place this will be observable is with nullable columns. A BigQuery column with type NULLABLE STRING may manifest in the PFB schema with the type null, [null, string], or string, depending on whether all/some/none of the values for that column are NULL within a given PFB manifest.

nadove-ucsc commented 22 hours ago

[edit, @hannes-ucsc, moved to description]

nadove-ucsc commented 22 hours ago

Related, not a dupe: https://github.com/DataBiosphere/azul/issues/6270