Open hannes-ucsc opened 2 hours ago
The way the distinction between bulk and non-bulk leaks into the to_index
method is annoying. The patch above shows that. The raw bulk
API is quite clean but the Python ES client erases that by merging body and parameters using proprietary keyword arguments whose names start with an underscore. The bulk
parameter to to_index
should be removed and that method should return the keyword arguments to the client method. Another method should be used to translate to_index
's return value to a bulk action, if and when required.
I suspect that one partition of a bundle redundantly emits replicas also emitted by all other partitions. That's a suspicion that would need to be verified.
Currently reindexing dcp43 with this hot-deployed temporary hotfix:
Index: deployments/prod/environment.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/deployments/prod/environment.py b/deployments/prod/environment.py
--- a/deployments/prod/environment.py (revision 0f4f04cf1d7f6f85a84329f733fdd6df2d8618d3)
+++ b/deployments/prod/environment.py (revision cebc97550db9409849f32456f9c51eef9347d164)
@@ -1280,5 +1280,5 @@
'channel_id': 'C04JWDFCPFZ' # #team-boardwalk-prod
}),
- 'AZUL_ENABLE_REPLICAS': '1',
+ 'AZUL_ENABLE_REPLICAS': '0',
}
Assignee to file PR against prod
with the above temporary hotfix.
The PR should only reindex lm7
and lm8
. dcp43
is currently being reindexed with this hotfix hot-deployed. I hope to get approval for dcp43
this week so that we can skip reindexing dcp42
and file the deletion of dcp42
as a permanent hotfix as well. We would then need to backport three hotfixes: this one, the lm8 fix (#6444) and the dcp42
removal.
The solution to #6582 fixed the omission of replicas for several types of low-cardinality, i.e., frequently referenced entities, such as
donor_organism
,sequencing_prototcol
andlibrary_preparation_protocol
. This led to increased contention on the respective documents in the replica index.[WARNING] 2024-10-23T14:36:17.525Z 6e6b10f5-9e45-5b8b-9b5a-4f6695def641 azul.indexer.index_service There was a conflict with document ReplicaCoordinates(entity=EntityReference(entity_type='sequencing_protocol', entity_id='571cc0c7-4dc2-443b-93f4-0ce4af08cf6d'), content_hash='a3a9f3a538a2a649690aa0974f4d1c070f6fc910'): ConflictError(409, 'version_conflict_engine_exception', {'error': {'root_cause': [{'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}], 'type': 'version_conflict_engine_exception', 'reason': '[sequencing_protocol_571cc0c7-4dc2-443b-93f4-0ce4af08cf6d_a3a9f3a538a2a649690aa0974f4d1c070f6fc910]: version conflict, required seqNo [55840], primary term [1]. current document has seqNo [55843] and primary term [1]', 'index_uuid': 't2PQs1SoTMueXWOCuGtvzQ', 'shard': '16', 'index': 'azul_v2_prod_dcp43_replica'}, 'status': 409}). Total # of errors: 1, retrying.
During an attempted reindex of catalog dcp32, 2 million replica writes resulted in such a 409 response from ES, compared to one thousand on a previous reindex without the fix for #6582:
The retry queue filled up with notifications and overall progress was slow:
I tried pushing the retries to ES by setting
retry_on_conflict
but this did not alleviate the issue much. While there were fewer conflicts, the ES client timeout of 1min kicked in more often.