DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
6 stars 2 forks source link

Some contributions by analysis bundles are redundant #3362

Open hannes-ucsc opened 3 years ago

hannes-ucsc commented 3 years ago

Continued from https://github.com/DataBiosphere/azul/issues/2909#issuecomment-904083511

When indexing an analysis subgraph only make contributions that are genuinely different from those made when a stitched-on input subgraph is indexed separately.

Since we don't want to read contributions before making any and can't easily constrain the indexing order of analysis subgraphs in relation to input subgraphs, we can't achieve this goal by naively comparing contributions.

Instead, we'll stop contributing inner entities from stitched subgraphs to outer entities from stitched subgraphs, and, if a contribution ends up being empty, we simply skip writing it.

While implementing this, reconsider any restrictions on the graph traversal and enumeration of entities if those restrictions are based on the is_stitched property. Generally it is easier to not restrict traversal/enumeration by is_stitched and let the above filtering take care of any redundancies.

hannes-ucsc commented 5 months ago

The need for stitching went away since BI isn't doing large-scale, harmonizing analysis for HCA anymore.