Closed hannes-ucsc closed 6 years ago
Thanks you very much for filing this issue @hannes-ucsc It would be really appreciated if Blue can prioritize fixing this when you have some bandwidth, since this is a blocker to let the dcp-integration test pass on V5 metdata schema.
I wouldn't call this a blocker. It occurs too infrequently.
@Bento007 this would be good if we could fix this in the next couple of days. I think we should change the indexer to repeatedly poll for the blob to appear until a timeout of, say, 1min occurs, and then fail with a bang (aka 'exception') without indexing anything.
This is the perfect example for when being liberal in what you expect (proceeding to index with a blob missing) merely ends up kicking the correctness can down the road, not fixing the issue, but merely making it harder to diagnose.
Note: besides creating an index document that is inconsistent with storage (once all blobs eventually show up in storage), the current behavior has another negative side-effect: Because of the missing file and the associated schema, the document yields a shape descriptor that lacks the entry for the missing schema and therefore gets indexed into a separate index. It will likely remain the only document in that index until this EC issue happens again for the same file on another bundle.
@kislyuk mentioned that this could also have been caused by an ordering problem in dss-sync
. If it copies the manifest before the blob, this issue could also occur. cc @ttung
Summarizing a in-person discussion of this:
@kislyuk thinks the long term fix is to use step-functions to orchestrate sync'ing of entire bundles thereby enforcing that a bundle's files are sync'ed before the bundle's manifest.
In the mean time, the temporary fix is to retry indexing the bundle withing the indexer invocation, even though that is limited to a maximum running time of 5min.
@amork proposed using his reaper/retry ~WIP~ feature for the temporary retry fix but @kislyuk feels that that is too complicated for a temporary fix and would like to save the effort for actually fixing this permanently as outlined above.
@Bento007 I think you can go ahead and implement the temporary fix of retrying until the file appears, but limited within the indexer invocation lambda.
Originally reported by @rexwangcc on Slack. When the indexer tried to index the
gcp
copy of a bundle submitted to theaws
replica earlier, it reported one of the blobs, the one forbiomaterials.json
, missing:https://logs.dev.data.humancellatlas.org/_plugin/kibana/app/kibana#/doc/*/cwl-2018-04-05/fromFirehose?id=49583086244926168104730357671478770103613383151955476482.0&_g=()
[You might have to copy that link, then click it and then paste it into the browser's address bar.]
The key part is
… this file will not be indexed. Exception: BlobNotFoundError …
.When I checked an hour later, the blob was present on the
gcp
replica.As a result, the indexer didn't index that file which caused a GreenBox subscription not to be notified because its query relies on fields from
biomaterial.json
.The following shows that the ES index documents differ between the two replicas:
The document for the
aws
replica contains thebiomaterials_json
key:which gives
The same command for
gcp
gives nothing.
This is the
integration
deployment, BTW.