HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

Metadata files in GCP are stored (or served) without Content-Type header #2073

Closed hannes-ucsc closed 5 years ago

hannes-ucsc commented 5 years ago

This can be most easily reproduced in the CLI

host:temp hannes$ python3 -m venv .venv
host:temp hannes$ . .venv/bin/activate
(.venv) host:temp hannes$ pip install hca
Collecting hca
  Downloading https://files.pythonhosted.org/packages/4a/3d/468a35203dd9fb5e42c2952301d2c97317759e47cbc5a89f53932ece409b/hca-5.0.2-py2.py3-none-any.whl (86kB)
    100% |████████████████████████████████| 92kB 4.9MB/s
... 
OMITTED 
...
Successfully installed Jinja2-2.10.1 MarkupSafe-1.1.1 PyJWT-1.7.1 PyYAML-3.13 argcomplete-1.9.5 argparse-1.4.0 asn1crypto-0.24.0 atomicwrites-1.3.0 awscli-1.16.145 boto3-1.9.135 botocore-1.12.135 cachetools-3.1.0 certifi-2019.3.9 cffi-1.12.3 chardet-3.0.4 colorama-0.3.9 commonmark-0.7.5 crc32c-1.7 crcmod-1.7 cryptography-2.3.1 dcplib-1.6.5 docutils-0.14 future-0.17.1 google-auth-1.6.3 google-auth-oauthlib-0.3.0 hca-5.0.2 idna-2.8 jmespath-0.9.4 jsonpointer-1.14 jsonschema-2.6.0 oauthlib-3.0.1 puremagic-1.4 pyasn1-0.4.5 pyasn1-modules-0.2.5 pycparser-2.19 python-dateutil-2.8.0 requests-2.21.0 requests-oauthlib-1.2.0 rsa-3.4.2 s3transfer-0.2.0 six-1.12.0 tenacity-5.0.4 tweak-1.0.2 urllib3-1.24.2
You are using pip version 9.0.3, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(.venv) host:temp hannes$ python
Python 3.6.5 (default, Jun 17 2018, 12:13:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hca.dss import DSSClient
>>> dss = DSSClient()
>>> aws_file = dss.get_file(uuid='16e48fd7-97e8-44a4-a20e-bebd673ccb98', version='2019-04-10T102139.752000Z', replica='aws')
>>> aws_file
{'describedBy': 'https://schema.humancellatlas.org/type/biomaterial/11.0.0/cell_suspension', 'schema_type': 'biomaterial', 'biomaterial_core': {'biomaterial_id': '22467_7#168', 'ncbi_taxon_id': [10090]}, 'genus_species': [{'text': 'Mus musculus', 'ontology': 'NCBITaxon:10090', 'ontology_label': 'Mus musculus'}], 'selected_cell_type': [{'text': 'B Cell', 'ontology': 'CL:0000236', 'ontology_label': 'B cell'}], 'estimated_cell_count': 1, 'plate_based_sequencing': {'plate_label': '604', 'well_label': 'B84'}, 'provenance': {'document_id': '16e48fd7-97e8-44a4-a20e-bebd673ccb98', 'submission_date': '2019-04-10T10:08:47.252Z', 'update_date': '2019-04-10T10:21:39.752Z'}}
>>> gcp_file = dss.get_file(uuid='16e48fd7-97e8-44a4-a20e-bebd673ccb98', version='2019-04-10T102139.752000Z', replica='gcp')
>>> gcp_file
b'{\n    "describedBy": "https://schema.humancellatlas.org/type/biomaterial/11.0.0/cell_suspension",\n    "schema_type": "biomaterial",\n    "biomaterial_core": {\n        "biomaterial_id": "22467_7#168",\n        "ncbi_taxon_id": [\n            10090\n        ]\n    },\n    "genus_species": [\n        {\n            "text": "Mus musculus",\n            "ontology": "NCBITaxon:10090",\n            "ontology_label": "Mus musculus"\n        }\n    ],\n    "selected_cell_type": [\n        {\n            "text": "B Cell",\n            "ontology": "CL:0000236",\n            "ontology_label": "B cell"\n        }\n    ],\n    "estimated_cell_count": 1,\n    "plate_based_sequencing": {\n        "plate_label": "604",\n        "well_label": "B84"\n    },\n    "provenance": {\n        "document_id": "16e48fd7-97e8-44a4-a20e-bebd673ccb98",\n        "submission_date": "2019-04-10T10:08:47.252Z",\n        "update_date": "2019-04-10T10:21:39.752Z"\n    }\n}'
>>>

Note that aws_file is a dict, while gcp_file is bytes. The reason is that the conditional

https://github.com/HumanCellAtlas/dcp-cli/blob/9a63cc86a163dd390fdf9949bdbc19c50e138f01/hca/util/__init__.py#L196

has a different outcome on the gcp replica because the response from there apparently lacks the Content-Type header.

I would expect both replica to behave identical. This is not a CLI bug, I just use the CLI to reproduce it.

hannes-ucsc commented 5 years ago

I don't know if this applies to all files but I saw this with many files. The file in the reproduction was randomly select. It's from a very recent bundle, see version.

xbrianh commented 5 years ago

Hitting the API directly with that file UUID appears to gives consistent results. Could this be in the CLI after all?

~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep Content-Type:
Content-Type: text/html; charset=utf-8
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep Content-Type:
Content-Type: text/html; charset=utf-8
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep X-DSS-CONTENT-TYPE
X-DSS-CONTENT-TYPE: application/json; dcp-type="metadata/biomaterial"
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep X-DSS-CONTENT-TYPE
X-DSS-CONTENT-TYPE: application/json; dcp-type="metadata/biomaterial"
hannes-ucsc commented 5 years ago

You need to follow redirects.

$ http -h --follow  https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep Content-Type
Content-Type: application/json; dcp-type="metadata/biomaterial"
$ http -h --follow  https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep Content-Type
Content-Type: application/octet-stream
xbrianh commented 5 years ago

Digging into this, let's get the blob key:

~>http HEAD https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 0
Content-Type: text/html; charset=utf-8
Date: Thu, 25 Apr 2019 02:54:16 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-AWS-REQUEST-ID: 7d590421-b748-42db-a47b-f8f9e00b5449
X-Amzn-Trace-Id: Root=1-5cc12158-e81850d480564b089629153c;Sampled=0
X-DSS-CONTENT-TYPE: application/json; dcp-type="metadata/biomaterial"
X-DSS-CRC32C: b4d5c8b1
X-DSS-CREATOR-UID: 8008
X-DSS-S3-ETAG: 014e62e89e5a2f0bee7f5e794a16e18b
X-DSS-SHA1: aaf128c0f0adf788c6940e6112cd8a180c7ad1ef
X-DSS-SHA256: 89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f
X-DSS-SIZE: 920
X-DSS-VERSION: 2019-04-10T102139.752000Z
x-amz-apigw-id: YrIlzEbpIAMFqfw=
x-amzn-RequestId: 6a0ac9c6-6705-11e9-81af-2545b0e2ed9b

We see in the logs this blob was synced from AWS to GCP:

[INFO]  2019-04-11T20:47:43.790Z    89c96650-28fd-4062-8bd0-661725aa930e    Finished transfer of blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1 from Replica.aws to Replica.gcp

Ok, let's look at the blob metadata on aws:

~>aws s3api head-object --bucket org-hca-dss-prod --key blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1
{
    "AcceptRanges": "bytes",
    "LastModified": "Thu, 11 Apr 2019 20:47:43 GMT",
    "ContentLength": 920,
    "ETag": "\"014e62e89e5a2f0bee7f5e794a16e18b\"",
    "ContentType": "application/json; dcp-type=\"metadata/biomaterial\"",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

and now on GCP:

~>gsutil ls -L gs://org-hca-dss-prod/blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1
gs://org-hca-dss-prod/blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1:
    Creation time:          Thu, 11 Apr 2019 20:47:43 GMT
    Update time:            Thu, 11 Apr 2019 20:47:43 GMT
    Storage class:          MULTI_REGIONAL
    Content-Length:         920
    Content-Type:           application/octet-stream
    Hash (crc32c):          tNXIsQ==
    Hash (md5):             AU5i6J5aLwvuf155Shbhiw==
    ETag:                   CIjCieL0yOECEAE=

The content type does not match. Perhaps a bug in the sync daemon.

xbrianh commented 5 years ago

This breaks checkout caching, which depends on blob content type.

hannes-ucsc commented 5 years ago

You need to actually fix the files. This is production. The example files are still broken.

$ http -h --follow  https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep Content-Type
Content-Type: application/octet-stream
$ http -h --follow  https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep Content-Type
Content-Type: application/json; dcp-type="metadata/biomaterial"
kozbo commented 5 years ago

Should be covered by #2096

hannes-ucsc commented 5 years ago

Sorry for being pedantic but, again, this is production we're talking about and we can't afford for this to slip through the cracks by prematurely closing this ticket without an actual resolution in place.