Out-of-order sync of blobs causes bundles to be partially indexed

hannes-ucsc commented 6 years ago

Originally reported by @rexwangcc on Slack. When the indexer tried to index the gcp copy of a bundle submitted to the aws replica earlier, it reported one of the blobs, the one for biomaterials.json, missing:

https://logs.dev.data.humancellatlas.org/_plugin/kibana/app/kibana#/doc/*/cwl-2018-04-05/fromFirehose?id=49583086244926168104730357671478770103613383151955476482.0&_g=()

[You might have to copy that link, then click it and then paste it into the browser's address bar.]

{
  "_index": "cwl-2018-04-05",
  "_type": "fromFirehose",
  "_id": "49583086244926168104730357671478770103613383151955476482.0",
  "_score": 1,
  "_source": {
    "@message": "[WARNING]\t2018-04-05T20:05:36.671Z\tb421931b-390c-11e8-b44c-1726d06a9efd\tIn bundle 26518a61-12b2-4ce8-ae03-bafd139b80eb.2018-04-05T200527.642039Z the file 'biomaterial.json' is marked for indexing yet could not be accessed. This file will not be indexed. Exception: BlobNotFoundError, file blob key: blobs/2e898cee332d8352ee55b7cc2031d738e141ff97075807891553e04c746134a0.64f83fddbbff486018bd13be5ee0cf362d88b65a.2fefd90d0e7d237337191e7c6fb8052b.4f2e2543\n",
    "@id": "33963114734376066906980778872193634681259338252515016746",
    "@timestamp": "2018-04-05T20:05:36.671Z",
    "@owner": "861229788715",
    "@log_group": "/aws/lambda/dss-index-integration",
    "@log_stream": "2018/04/05/[$LATEST]3f736ddaa55944418c37efa4d4e5b825"
  },
  "fields": {
    "@timestamp": [
      1522958736671
    ]
  }
}

The key part is … this file will not be indexed. Exception: BlobNotFoundError ….

When I checked an hour later, the blob was present on the gcp replica.

As a result, the indexer didn't index that file which caused a GreenBox subscription not to be notified because its query relies on fields from biomaterial.json.

The following shows that the ES index documents differ between the two replicas:

The document for the aws replica contains the biomaterials_json key:

ca dss post-search --replica aws --es-query='
{
  "query": {
    "term": {
      "_id": {
        "value": "26518a61-12b2-4ce8-ae03-bafd139b80eb.2018-04-05T200527.642039Z"
      }
    }
  }
}' --output-format=raw | jq '..|.biomaterial_json?//empty'

which gives

{
  "biomaterials": [
    {
      "content": {
        "biomaterial_core": {
          "biomaterial_id": "Q4_DEMO-sample_SAMN02797092",
          "biomaterial_name": "Q4_DEMO-Single cell mRNA-seq_MGH30_A01",
          "has_input_biomaterial": "Q4_DEMO-donor_MGH30",
          "ncbi_taxon_id": [
            9606
          ],
          "supplementary_files": [
            "Q4_DEMO-protocol"
          ]
        },
        "describedBy": "https://schema.humancellatlas.org/type/biomaterial/5.1.0/specimen_from_organism",
        "genus_species": [
          {
            "ontology": "NCBITaxon:9606",
            "text": "Homo sapiens"
          }
        ],
        "organ": {
          "ontology": "astrocyte",
          "text": "brain"
        },
        "organ_part": {
          "text": "glioblastoma"
        },
        "schema_type": "biomaterial"
      },
      "hca_ingest": {
        "accession": "",
        "document_id": "1f723151-fda4-4cec-8d0e-e6759abe7037",
        "submissionDate": "2018-04-05T20:00:30.528Z",
        "updateDate": "2018-04-05T20:00:38.613Z"
      }
    },
    {
      "content": {
        "biological_sex": "unknown",
        "biomaterial_core": {
          "biomaterial_id": "Q4_DEMO-donor_MGH30",
          "biomaterial_name": "Q4 DEMO donor MGH30",
          "ncbi_taxon_id": [
            9606
          ]
        },
        "describedBy": "https://schema.humancellatlas.org/type/biomaterial/5.1.0/donor_organism",
        "genus_species": [
          {
            "ontology": "NCBITaxon:9606",
            "text": "Homo sapiens"
          }
        ],
        "is_living": true,
        "schema_type": "biomaterial"
      },
      "hca_ingest": {
        "accession": "",
        "document_id": "2955c6ef-85f1-4c0e-b6cb-e3bd17fc6443",
        "submissionDate": "2018-04-05T20:00:30.551Z",
        "updateDate": "2018-04-05T20:00:38.235Z"
      }
    }
  ],
  "describedBy": "https://schema.humancellatlas.org/bundle/5.1.0/biomaterial",
  "schema_type": "biomaterial_bundle",
  "schema_version": "5.1.0"
}

The same command for gcp

hca dss post-search --replica gcp --es-query='
{
  "query": {
    "term": {
      "_id": {
        "value": "26518a61-12b2-4ce8-ae03-bafd139b80eb.2018-04-05T200527.642039Z"
      }
    }
  }
}' --output-format=raw | jq '..|.biomaterial_json?//empty'

gives nothing.

This is the integration deployment, BTW.

rexwangcc commented 6 years ago

Thanks you very much for filing this issue @hannes-ucsc It would be really appreciated if Blue can prioritize fixing this when you have some bandwidth, since this is a blocker to let the dcp-integration test pass on V5 metdata schema.

hannes-ucsc commented 6 years ago

I wouldn't call this a blocker. It occurs too infrequently.

hannes-ucsc commented 6 years ago

@Bento007 this would be good if we could fix this in the next couple of days. I think we should change the indexer to repeatedly poll for the blob to appear until a timeout of, say, 1min occurs, and then fail with a bang (aka 'exception') without indexing anything.

This is the perfect example for when being liberal in what you expect (proceeding to index with a blob missing) merely ends up kicking the correctness can down the road, not fixing the issue, but merely making it harder to diagnose.

hannes-ucsc commented 6 years ago

Note: besides creating an index document that is inconsistent with storage (once all blobs eventually show up in storage), the current behavior has another negative side-effect: Because of the missing file and the associated schema, the document yields a shape descriptor that lacks the entry for the missing schema and therefore gets indexed into a separate index. It will likely remain the only document in that index until this EC issue happens again for the same file on another bundle.

hannes-ucsc commented 6 years ago

@kislyuk mentioned that this could also have been caused by an ordering problem in dss-sync. If it copies the manifest before the blob, this issue could also occur. cc @ttung

hannes-ucsc commented 6 years ago

Summarizing a in-person discussion of this:

@kislyuk thinks the long term fix is to use step-functions to orchestrate sync'ing of entire bundles thereby enforcing that a bundle's files are sync'ed before the bundle's manifest.

In the mean time, the temporary fix is to retry indexing the bundle withing the indexer invocation, even though that is limited to a maximum running time of 5min.

@amork proposed using his reaper/retry ~WIP~ feature for the temporary retry fix but @kislyuk feels that that is too complicated for a temporary fix and would like to save the effort for actually fixing this permanently as outlined above.

@Bento007 I think you can go ahead and implement the temporary fix of retrying until the file appears, but limited within the indexer invocation lambda.

HumanCellAtlas / data-store

Out-of-order sync of blobs causes bundles to be partially indexed #1115