Open nadove-ucsc opened 1 month ago
Currently, AnVIL replica bundles aren't recorded anywhere in ElasticSearch. There are no contributions because replica bundles don't emit any contributions, and there are no replicas because we don't emit replicas for AnVIL bundles (since they're synthetic) and other replicas don't include any information on what bundle(s) they originated from.
During the most recent reindex for anvilprod
, replica bundles accounted for 12.5% of the total number of bundles:
**CloudWatch Logs Insights**
region: us-east-1
log-group-names: /aws/lambda/azul-indexer-anvilprod-contribute
start-time: 2024-11-12T08:00:00.000Z
end-time: 2024-11-17T07:59:59.000Z
query-string:
fields @message
| parse @message ' is a * bundle' as bundle_type
| stats count(*) by bundle_type
---
| bundle_type | count(*) |
| --- | --- |
| | 52963921 |
| replica | 50022 |
| supplementary | 21671 |
| primary | 327020 |
| DUOS | 256 |
---
The current integration test already covers the possibility of indexing failures for replica bundles because it will fail if there are any messages in the fail queue. The current test also is not effective at testing whether the indexing was truly "complete", as it has no means of verifying that all expected non-bundle entities are present. So, as currently written, I'd argue that the omission of replica bundles from _assert_catalog_complete
does not represent a meaningful lack of coverage.
The easiest way to include replica bundles in the catalog_complete
subtest would be to just emit contributions for them. These bundles would have no inner entities besides the dataset, and ideally, we would also suppress the dataset. This would result in a 14.3% increase in the number of bundle contributions and aggregates. This would violate our current tenet that replica bundles never emit contributions. It would have no other benefit besides facilitating this change to the IT.
If we continue to adhere to the tenet that replica bundles never emit contributions, then they'll need to record their existence via replicas instead. We could add a bundle_fqid
field to replicas, but would add a lot of complexity because replicas can be emitted by multiple bundles and we'd need to resolve conflicts via a scripted update like we currently do for hub IDs. Note every replica is emitted by exactly one replica bundle and zero or more non-replica bundles.
Perhaps a better idea would be emit special "stub" replicas for bundles, with no content or hub IDs, that would only be read during the IT. The changes to the indexer would be smaller and more localized in this case, and I would expect the performance impact to be smaller as well. But these replicas would serve no purpose once the IT is finished.
Spike for design