Closed hannes-ucsc closed 5 years ago
I don't know if this applies to all files but I saw this with many files. The file in the reproduction was randomly select. It's from a very recent bundle, see version.
Hitting the API directly with that file UUID appears to gives consistent results. Could this be in the CLI after all?
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep Content-Type:
Content-Type: text/html; charset=utf-8
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep Content-Type:
Content-Type: text/html; charset=utf-8
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep X-DSS-CONTENT-TYPE
X-DSS-CONTENT-TYPE: application/json; dcp-type="metadata/biomaterial"
~>http -h https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep X-DSS-CONTENT-TYPE
X-DSS-CONTENT-TYPE: application/json; dcp-type="metadata/biomaterial"
You need to follow redirects.
$ http -h --follow https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep Content-Type
Content-Type: application/json; dcp-type="metadata/biomaterial"
$ http -h --follow https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep Content-Type
Content-Type: application/octet-stream
Digging into this, let's get the blob key:
~>http HEAD https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 0
Content-Type: text/html; charset=utf-8
Date: Thu, 25 Apr 2019 02:54:16 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-AWS-REQUEST-ID: 7d590421-b748-42db-a47b-f8f9e00b5449
X-Amzn-Trace-Id: Root=1-5cc12158-e81850d480564b089629153c;Sampled=0
X-DSS-CONTENT-TYPE: application/json; dcp-type="metadata/biomaterial"
X-DSS-CRC32C: b4d5c8b1
X-DSS-CREATOR-UID: 8008
X-DSS-S3-ETAG: 014e62e89e5a2f0bee7f5e794a16e18b
X-DSS-SHA1: aaf128c0f0adf788c6940e6112cd8a180c7ad1ef
X-DSS-SHA256: 89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f
X-DSS-SIZE: 920
X-DSS-VERSION: 2019-04-10T102139.752000Z
x-amz-apigw-id: YrIlzEbpIAMFqfw=
x-amzn-RequestId: 6a0ac9c6-6705-11e9-81af-2545b0e2ed9b
We see in the logs this blob was synced from AWS to GCP:
[INFO] 2019-04-11T20:47:43.790Z 89c96650-28fd-4062-8bd0-661725aa930e Finished transfer of blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1 from Replica.aws to Replica.gcp
Ok, let's look at the blob metadata on aws:
~>aws s3api head-object --bucket org-hca-dss-prod --key blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1
{
"AcceptRanges": "bytes",
"LastModified": "Thu, 11 Apr 2019 20:47:43 GMT",
"ContentLength": 920,
"ETag": "\"014e62e89e5a2f0bee7f5e794a16e18b\"",
"ContentType": "application/json; dcp-type=\"metadata/biomaterial\"",
"ServerSideEncryption": "AES256",
"Metadata": {}
}
and now on GCP:
~>gsutil ls -L gs://org-hca-dss-prod/blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1
gs://org-hca-dss-prod/blobs/89f880cb12f9acd7a78e6a0228b9dad4644be26d813216498691049793e0780f.aaf128c0f0adf788c6940e6112cd8a180c7ad1ef.014e62e89e5a2f0bee7f5e794a16e18b.b4d5c8b1:
Creation time: Thu, 11 Apr 2019 20:47:43 GMT
Update time: Thu, 11 Apr 2019 20:47:43 GMT
Storage class: MULTI_REGIONAL
Content-Length: 920
Content-Type: application/octet-stream
Hash (crc32c): tNXIsQ==
Hash (md5): AU5i6J5aLwvuf155Shbhiw==
ETag: CIjCieL0yOECEAE=
The content type does not match. Perhaps a bug in the sync daemon.
This breaks checkout caching, which depends on blob content type.
You need to actually fix the files. This is production. The example files are still broken.
$ http -h --follow https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=gcp | grep Content-Type
Content-Type: application/octet-stream
$ http -h --follow https://dss.data.humancellatlas.org/v1/files/16e48fd7-97e8-44a4-a20e-bebd673ccb98?replica=aws | grep Content-Type
Content-Type: application/json; dcp-type="metadata/biomaterial"
Should be covered by #2096
Sorry for being pedantic but, again, this is production we're talking about and we can't afford for this to slip through the cracks by prematurely closing this ticket without an actual resolution in place.
This can be most easily reproduced in the CLI
Note that
aws_file
is adict
, whilegcp_file
isbytes
. The reason is that the conditionalhttps://github.com/HumanCellAtlas/dcp-cli/blob/9a63cc86a163dd390fdf9949bdbc19c50e138f01/hca/util/__init__.py#L196
has a different outcome on the
gcp
replica because the response from there apparently lacks theContent-Type
header.I would expect both replica to behave identical. This is not a CLI bug, I just use the CLI to reproduce it.