DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Generate Avro schema from HCA schemas #6270

Open nadove-ucsc opened 4 months ago

nadove-ucsc commented 4 months ago

As with AnVIL, we could create the Avro schema for HCA verbatim PFB manifests using the published entity schemas, e.g. https://schema.humancellatlas.org/type/project/14.0.0/project

hannes-ucsc commented 4 months ago

This would be complicated by the fact that we're indexing a diverse set of schema versions in a single catalog. We would have to compile all versions of all schemas during indexing, aggregate the schemas and generate the PFB schema from that aggregate. The schema aggregation would have to consider every property and every type of every property. Renamed properties would show up under their old and their new names, and removed properties would have to be retained. Invariantly, this semi-static process would need to produce the same PFB schema as the current dynamic schema generation for a manifest without filters.