DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Can't generate verbatim manifest without filters in AnVIL #6108

Open achave11-ucsc opened 7 months ago

achave11-ucsc commented 7 months ago

https://github.com/DataBiosphere/azul/issues/6108#issuecomment-2035944695

dsotirho-ucsc commented 6 months ago

Assignee to consider next steps.

hannes-ucsc commented 6 months ago

The generation simply times out.

Our standard approach to this problem is to partition the manifest as we do for the compact manifest, but the verbatim manifest formats complicate this by having to ensure that each replica is only written once. The AvroPFB verbatim manifest further complicates this with the forward-only reference constraint that Terra imposes. We are currently working around that constraint by not exposing relations/links in the AvroPFB schema we generate.

The naive approach to ensuring uniqueness of replicas in the generated manifest is to use a set of already emitted replicas. Since each partition in a partitioned manifest runs in a different Lambda invocation and potentially a different execution context, we would have to persist that set between invocations. My original design already describes optimizations to reduce the size of the set (not tracking replicas with just one hub, tracking hubs instead of replicas). A smaller set is obviously faster to read and write.

I don't have a good solution at this time. Interestingly, the verbatim JSONL manifest for HCA does not timeout (without links, we don't know if adding links breaks that). I don't think there is much of a use case for a all-inclusive, unfiltered AvroPFB manifest. The purpose of that manifest is exporting it to Terra. The resulting Terra workspace would be huge for both AnVIL and HCA.