DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
6 stars 2 forks source link

BDBag ZIP file contains temporary directory name #991

Closed hannes-ucsc closed 5 years ago

hannes-ucsc commented 5 years ago
azul.stable hannes$ mkdir foo
azul.stable hannes$ cd foo/
foo hannes$ http -v 'https://service.explore.data.humancellatlas.org/fetch/manifest/files?filters=%7B%22file%22%3A%7B%22projectId%22%3A%7B%22is%22%3A%5B%22179bf9e6-5b33-4c5b-ae26-96c7270976b8%22%5D%7D%2C%22fileFormat%22%3A%7B%22is%22%3A%5B%22bam%22%2C%22fastq.gz%22%5D%7D%7D%7D&format=bdbag'
GET /fetch/manifest/files?filters=%7B%22file%22%3A%7B%22projectId%22%3A%7B%22is%22%3A%5B%22179bf9e6-5b33-4c5b-ae26-96c7270976b8%22%5D%7D%2C%22fileFormat%22%3A%7B%22is%22%3A%5B%22bam%22%2C%22fastq.gz%22%5D%7D%7D%7D&format=bdbag HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: service.explore.data.humancellatlas.org
User-Agent: HTTPie/0.9.9

HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 228
Content-Type: application/json
Date: Mon, 13 May 2019 20:52:50 GMT
Via: 1.1 b7eb7fa2bc4ab6266ed04710fbd0842c.cloudfront.net (CloudFront)
X-Amz-Cf-Id: h2fB2B7j63LpkZFqNav-y-4BkFOApKxeM2izoUExZ5gnwMcfsh8DcQ==
X-Amzn-Trace-Id: Root=1-5cd9d922-a93c6df0ea758e008cbbe238;Sampled=0
X-Cache: Miss from cloudfront
x-amz-apigw-id: Zo7dYED4IAMFmJA=
x-amzn-RequestId: 1218495f-75c1-11e9-a215-ddb95fac0547

{
    "Location": "https://service.explore.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiMDk2MWI3M2QtMjA2ZS00NTk0LWE3YzMtOWE3OGFjMjk1YWYzIiwgInJlcXVlc3RfaW5kZXgiOiAxfQ==",
    "Retry-After": 1,
    "Status": 301
}

foo hannes$ http -v 'https://service.explore.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiMDk2MWI3M2QtMjA2ZS00NTk0LWE3YzMtOWE3OGFjMjk1YWYzIiwgInJlcXVlc3RfaW5kZXgiOiAxfQ=='
GET /fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiMDk2MWI3M2QtMjA2ZS00NTk0LWE3YzMtOWE3OGFjMjk1YWYzIiwgInJlcXVlc3RfaW5kZXgiOiAxfQ== HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: service.explore.data.humancellatlas.org
User-Agent: HTTPie/0.9.9

HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 1102
Content-Type: application/json
Date: Mon, 13 May 2019 20:53:22 GMT
Via: 1.1 5eb275fcc12eed81d31710e5eed4b529.cloudfront.net (CloudFront)
X-Amz-Cf-Id: 5sBy01hnzj9Ko3-8zsFrU7Yh1OpwaL61BbREKCcIqnBhNPxMEFxn7Q==
X-Amzn-Trace-Id: Root=1-5cd9d942-83f0c031210efcb5d814bd9b;Sampled=0
X-Cache: Miss from cloudfront
x-amz-apigw-id: Zo7iZHswIAMFgbQ=
x-amzn-RequestId: 254382c4-75c1-11e9-a75d-1913f2feec65

{
    "Location": "https://azul-storage-prod.s3.amazonaws.com/manifests/755e97e0-1575-449e-b89d-682e5d32739f.zip?AWSAccessKeyId=ASIARSZHKI4KIONURBM3&Signature=TT4okTzGWrAN1BsEryvoJdP2YN8%3D&x-amz-security-token=AgoJb3JpZ2luX2VjEMz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQDWwMFF7Oqw%2FJlKesQQTmH28K6NldgO55pRGcORgso7gQIgakq4nKeFVNW7zQOhrp0Hau7ZuawSKCVj3hcG6IxZN6gqnwIIxf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgwxMDkwNjcyNTc2MjAiDDViG0tFV1JDwVYQsCrzAcP%2BNla8KRUrtqM0c%2Bq6M27XphRSEZkFni73S4OsRUoemfZ3Yfp2zrtkS8CwmYv7y%2FrDGCdryPpocfPSwQotr%2FuMibaMuwTXKnW6ItCSQ%2Bblr%2FqgoHkP5VszZaAXqiXj6PIGj0j9oFnI%2B7PKB3OmPp7hTsKKVYk27ARmGuRkPJaU4eV4fSzzbu856Vk6mYPu23mgDSjaPJAHfDeU7mPCGpYIsxBnfVQfKQNQijTDfAuvxqT25GaWSQawDFJtBZF4z8AjfIpCc35YvqR86t79T8%2BxsBvLcf70GIvA4sLb0B34J7t5rT64zE148runwpZ5y7mbHjD5kefmBTq0ATRkHpsR5fA6%2FX3OElVNXIe5ZunxZrXyNPwMng9vyxz6zZkEfrrRq4c7zmMbYaIoMQtZHC2Sf3B%2B7%2BkgRAtFkL01cjtB66BvyO2QgjtFfafydMqqjifaRmSV%2BwmXuO8mBNxMmwbZf2MzadUGnGxGsTV8Z7Dm14zMII5YvtI3dXO%2Fli10%2FGtfnNL4hg%2FoBo1MA0CoHK8Jfv2FQgKpafC1k4DYtTOJtvT6mqCMF6muFKBpLuXQRA%3D%3D&Expires=1557784371",
    "Status": 302
}

foo hannes$ curl -o foo.zip 'https://azul-storage-prod.s3.amazonaws.com/manifests/755e97e0-1575-449e-b89d-682e5d32739f.zip?AWSAccessKeyId=ASIARSZHKI4KIONURBM3&Signature=TT4okTzGWrAN1BsEryvoJdP2YN8%3D&x-amz-security-token=AgoJb3JpZ2luX2VjEMz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIQDWwMFF7Oqw%2FJlKesQQTmH28K6NldgO55pRGcORgso7gQIgakq4nKeFVNW7zQOhrp0Hau7ZuawSKCVj3hcG6IxZN6gqnwIIxf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARABGgwxMDkwNjcyNTc2MjAiDDViG0tFV1JDwVYQsCrzAcP%2BNla8KRUrtqM0c%2Bq6M27XphRSEZkFni73S4OsRUoemfZ3Yfp2zrtkS8CwmYv7y%2FrDGCdryPpocfPSwQotr%2FuMibaMuwTXKnW6ItCSQ%2Bblr%2FqgoHkP5VszZaAXqiXj6PIGj0j9oFnI%2B7PKB3OmPp7hTsKKVYk27ARmGuRkPJaU4eV4fSzzbu856Vk6mYPu23mgDSjaPJAHfDeU7mPCGpYIsxBnfVQfKQNQijTDfAuvxqT25GaWSQawDFJtBZF4z8AjfIpCc35YvqR86t79T8%2BxsBvLcf70GIvA4sLb0B34J7t5rT64zE148runwpZ5y7mbHjD5kefmBTq0ATRkHpsR5fA6%2FX3OElVNXIe5ZunxZrXyNPwMng9vyxz6zZkEfrrRq4c7zmMbYaIoMQtZHC2Sf3B%2B7%2BkgRAtFkL01cjtB66BvyO2QgjtFfafydMqqjifaRmSV%2BwmXuO8mBNxMmwbZf2MzadUGnGxGsTV8Z7Dm14zMII5YvtI3dXO%2Fli10%2FGtfnNL4hg%2FoBo1MA0CoHK8Jfv2FQgKpafC1k4DYtTOJtvT6mqCMF6muFKBpLuXQRA%3D%3D&Expires=1557784371'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  169k  100  169k    0     0   280k      0 --:--:-- --:--:-- --:--:--  280k
foo hannes$ ls
foo.zip
foo hannes$ unzip foo.zip
Archive:  foo.zip
  inflating: tmpbuddkkoa/manifest-sha256.txt
  inflating: tmpbuddkkoa/tagmanifest-md5.txt
  inflating: tmpbuddkkoa/bagit.txt
  inflating: tmpbuddkkoa/bag-info.txt
  inflating: tmpbuddkkoa/tagmanifest-sha256.txt
  inflating: tmpbuddkkoa/manifest-md5.txt
  inflating: tmpbuddkkoa/data/samples.tsv

┆Issue is synchronized with this Jira Story ┆Project Name: azul ┆Issue Number: AZUL-618

hannes-ucsc commented 5 years ago

This may be the underlying cause of #979.

mikebaumann commented 5 years ago

I think it is currently an open question whether the temporary directory name matters or not to Terra. I just now downloaded and listed one of the BDBags from the UCSC Commons data browser, for which automated import into Terra is working, and its internal structure is as follows:

$ unzip -l 9c8a7496-a0c5-49f5-92a5-3fd748afaa80.zip
Archive:  9c8a7496-a0c5-49f5-92a5-3fd748afaa80.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      204  05-14-2019 19:05   manifest/bag-info.txt
      323  05-14-2019 19:05   manifest/tagmanifest-sha256.txt
      579  05-14-2019 19:05   manifest/tagmanifest-sha512.txt
      299  05-14-2019 19:05   manifest/manifest-sha512.txt
       55  05-14-2019 19:05   manifest/bagit.txt
      171  05-14-2019 19:05   manifest/manifest-sha256.txt
       68  05-14-2019 19:05   manifest/data/participants.tsv
   104236  05-14-2019 19:05   manifest/data/samples.tsv
---------                     -------
   105935                     8 files

Yet, looking at the Broad's code, I am not sure the name of the top-level directory matters: https://github.com/broadinstitute/firecloud-orchestration/blob/6eb6c12c34f6bfc556161eb9d712f645f1d55e22/src/main/scala/org/broadinstitute/dsde/firecloud/EntityClient.scala#L65-L91

it looks to me like it just iterates over all the entries in the Zip file looking for one that's not a directory and the pathname contains /samples.tsv. I don't see any specific occurrences of manifests, nor do I see why having a temp directory name instead would matter.

I am checking with @mjkrause to see if he has any additional information regarding this.

hannes-ucsc commented 5 years ago

https://tools.ietf.org/html/rfc8493#section-2

The base directory can have any name, as illustrated by the figure below.

That of course doesn't mean that Terra doesn't depend on manifest being the directory name.

hannes-ucsc commented 5 years ago

I don't see any specific occurrences of manifests, nor do I see why having a temp directory name instead would matter.

Don't you want to search for singular, manifest?

hannes-ucsc commented 5 years ago

I still think we should have a deterministic name, even if the spec allows for any name.

mikebaumann commented 5 years ago

Yes, I did search manifest singular in the code, the plural manifests was a typo in the comment above.

mikebaumann commented 5 years ago

Reposing a message from @mjkrause:

I'm not aware of any convention as for the BDBag name with the Broad or Terra/FireCloud. It is true that we had named the folder explicitly manifest for the commons, and I think at that point I simple found it to be less confusing than the little-intuitive name of a temporary folder. In addition, as far as understand from glancing at the code manifest was inside of a temporary folder, so the commons temporary folder structure is one layer deeper than the HCA bag. But the file(s) in the payload not be affected by that.

mikebaumann commented 5 years ago

I think a deterministic name for the top-level directory makes sense, and I think continuing use of the name manifest is a good choice.

theathorn commented 5 years ago

Demoed 5/28/19