DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Integration test does not assert that replica bundles are indexed #6647

Open nadove-ucsc opened 1 month ago

nadove-ucsc commented 2 weeks ago

Spike for design

nadove-ucsc commented 1 week ago

Currently, AnVIL replica bundles aren't recorded anywhere in ElasticSearch. There are no contributions because replica bundles don't emit any contributions, and there are no replicas because we don't emit replicas for AnVIL bundles (since they're synthetic) and other replicas don't include any information on what bundle(s) they originated from.

nadove-ucsc commented 1 week ago

During the most recent reindex for anvilprod, replica bundles accounted for 12.5% of the total number of bundles:

**CloudWatch Logs Insights**      
region: us-east-1      
log-group-names: /aws/lambda/azul-indexer-anvilprod-contribute      
start-time: 2024-11-12T08:00:00.000Z      
end-time: 2024-11-17T07:59:59.000Z      
query-string:

    fields @message
| parse @message ' is a * bundle' as bundle_type
| stats count(*) by bundle_type

---
| bundle_type | count(*) |
| --- | --- |
|  | 52963921 |
| replica | 50022 |
| supplementary | 21671 |
| primary | 327020 |
| DUOS | 256 |
---
nadove-ucsc commented 1 week ago

The current integration test already covers the possibility of indexing failures for replica bundles because it will fail if there are any messages in the fail queue. The current test also is not effective at testing whether the indexing was truly "complete", as it has no means of verifying that all expected non-bundle entities are present. So, as currently written, I'd argue that the omission of replica bundles from _assert_catalog_complete does not represent a meaningful lack of coverage.

nadove-ucsc commented 1 week ago

The easiest way to include replica bundles in the catalog_complete subtest would be to just emit contributions for them. These bundles would have no inner entities besides the dataset, and ideally, we would also suppress the dataset. This would result in a 14.3% increase in the number of bundle contributions and aggregates. This would violate our current tenet that replica bundles never emit contributions. It would have no other benefit besides facilitating this change to the IT.

nadove-ucsc commented 1 week ago

If we continue to adhere to the tenet that replica bundles never emit contributions, then they'll need to record their existence via replicas instead. We could add a bundle_fqid field to replicas, but would add a lot of complexity because replicas can be emitted by multiple bundles and we'd need to resolve conflicts via a scripted update like we currently do for hub IDs. Note every replica is emitted by exactly one replica bundle and zero or more non-replica bundles.

nadove-ucsc commented 1 week ago

Perhaps a better idea would be emit special "stub" replicas for bundles, with no content or hub IDs, that would only be read during the IT. The changes to the indexer would be smaller and more localized in this case, and I would expect the performance impact to be smaller as well. But these replicas would serve no purpose once the IT is finished.