Open theathorn opened 1 year ago
It seems like no files (fastq.gz format, from the specified project) are downloadable via curl when they also lack DRS URIs download links in the files tab in the Data Browser. The following curl command requests said files, and fails to execute:
curl --location --fail 'https://service.azul.data.humancellatlas.org/manifest/files?catalog=lm2&format=curl&filters=%7B%22genusSpecies%22%3A+%7B%22is%22%3A+%5B%22Homo+sapiens%22%5D%7D%2C+%22fileFormat%22%3A+%7B%22is%22%3A+%5B%22fastq.gz%22%5D%7D%7D&objectKey=manifests%2F1ac72f75-b538-5e6c-a3d2-ac205a3adaab.6b76de54-7552-5fa5-b96c-db21be0be76b.curlrc' | curl --config -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2977 100 2977 0 0 7340 0 --:--:-- --:--:-- --:--:-- 7442
100 14718 100 14718 0 0 18765 0 --:--:-- --:--:-- --:--:-- 18765
curl: no URL specified!
curl: try 'curl --help' or 'curl --manual' for more information
The following is an excerpt of the curl file:
--create-dirs
--compressed
--location
--globoff
--fail
--fail-early
--continue-at -
--retry 2
--retry-delay 10
--write-out "Downloading to: %{filename_effective}\n\n"
# File '017bab09-21e9-4ca4-aa5b-fde7545b77b2', version '2022-02-22T18:41:19.452809Z' is currently not available in catalog 'lm2'.
# File '059ba534-61dd-46fa-aa3f-c4e2bb87c579', version '2022-03-03T19:03:25.282692Z' is currently not available in catalog 'lm2'.
# File '09720403-be2c-449a-8234-f9e6a871d785', version '2022-02-22T18:38:43.917466Z' is currently not available in catalog 'lm2'.
# File '0f60aa46-2715-40e3-8b69-66bc1b9e58dd', version '2022-03-03T19:02:08.127234Z' is currently not available in catalog 'lm2'.
… more unavailable files …
# File 'f4fecb67-2902-49fd-be31-2283bef1304d', version '2022-03-03T19:01:44.104344Z' is currently not available in catalog 'lm2'.
# File 'f8b4d49b-0eca-4681-aadf-04fc5b70e0f2', version '2022-03-03T19:03:03.008575Z' is currently not available in catalog 'lm2'.
# File 'f8ce0bfd-9e23-443d-8d66-31229abbdfa4', version '2022-02-22T18:41:32.165967Z' is currently not available in catalog 'lm2'.
# File 'fbc34d0f-8a19-453a-8d0f-2ecdd76396be', version '2022-03-03T19:01:52.073110Z' is currently not available in catalog 'lm2'.
# File 'fc7694e0-ca99-433e-b437-e466fec229f4', version '2022-03-03T19:02:26.742226Z' is currently not available in catalog 'lm2'.
# File 'fd125d76-4fa7-471c-9726-99484e9fc6b3', version '2022-03-03T19:02:50.882865Z' is currently not available in catalog 'lm2'.
# File 'fdf15d5f-5521-492b-b781-e2c887e3affd', version '2022-02-22T18:41:17.657769Z' is currently not available in catalog 'lm2'.
So far this is the expected behavior when downloading a curl manifest in which all files are unavailable. Please try again with a mix of files, some available, some not. The expected behavior is that the available files are downloaded.
The following curl command was requested with the same project selected, and the same filter on fastq.gz
(unavailable) file format with the addition of csv
format type.
The csv
type files are available for download from the files tab in the Data Browser:
curl --location --fail 'https://service.azul.data.humancellatlas.org/manifest/files?catalog=lm2&format=curl&filters=%7B%22projectId%22%3A+%7B%22is%22%3A+%5B%22f899709c-ae2c-4bb9-88f0-131142e6c7ec%22%5D%7D%2C+%22fileFormat%22%3A+%7B%22is%22%3A+%5B%22fastq.gz%22%2C+%22csv%22%5D%7D%2C+%22genusSpecies%22%3A+%7B%22is%22%3A+%5B%22Homo+sapiens%22%5D%7D%7D&objectKey=manifests%2F20735f69-b4c5-5ac2-8b05-45a9dfc3105d.6b76de54-7552-5fa5-b96c-db21be0be76b.curlrc' | curl --config -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2917 100 2917 0 0 1938 0 0:00:01 0:00:01 --:--:-- 1949
100 10206 100 10206 0 0 5151 0 0:00:01 0:00:01 --:--:-- 3322k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0
100 3826k 100 3826k 0 0 691k 0 0:00:05 0:00:05 --:--:-- 5815k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_UMAP.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 9548k 100 9548k 0 0 3421k 0 0:00:02 0:00:02 --:--:-- 15.4M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_Annotation.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 3282k 100 3282k 0 0 1311k 0 0:00:02 0:00:02 --:--:-- 1311k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_Annotation.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0
100 2976k 100 2976k 0 0 539k 0 0:00:05 0:00:05 --:--:-- 7189k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_UMAP.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1025k 100 1025k 0 0 455k 0 0:00:02 0:00:02 --:--:-- 455k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_UMAP.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1820k 100 1820k 0 0 866k 0 0:00:02 0:00:02 --:--:-- 866k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_Annotation.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1027k 100 1027k 0 0 463k 0 0:00:02 0:00:02 --:--:-- 11.9M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_UMAP.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1205k 100 1205k 0 0 545k 0 0:00:02 0:00:02 --:--:-- 9881k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_UMAP.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1063k 100 1063k 0 0 503k 0 0:00:02 0:00:02 --:--:-- 19.0M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_UMAP.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 2209k 100 2209k 0 0 993k 0 0:00:02 0:00:02 --:--:-- 993k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_Annotation.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 1858k 100 1858k 0 0 844k 0 0:00:02 0:00:02 --:--:-- 844k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_Annotation.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0
100 6893k 100 6893k 0 0 1203k 0 0:00:05 0:00:05 --:--:-- 15.6M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_Annotation.csv
$tree
└── 31244706-60f8-3043-8db7-a39fd7081139
├── PeripCOPD_AllCells_Annotation.csv
├── PeripCOPD_AllCells_UMAP.csv
├── PeripCOPD_EpiCells_Annotation.csv
├── PeripCOPD_EpiCells_UMAP.csv
├── PeriphNorm_AllCells_Annotation.csv
├── PeriphNorm_AllCells_UMAP.csv
├── PeriphNorm_EpiCells_Annotation.csv
├── PeriphNorm_EpiCells_UMAP.csv
├── ProxhNorm_AllCells_Annotation.csv
├── ProxhNorm_AllCells_UMAP.csv
├── ProxhNorm_EpiCells_Annotation.csv
└── ProxhNorm_EpiCells_UMAP.csv
1 directory, 12 files
The available files (csv
format) were downloaded as expected and the fasq.gz
were not. The expected behavior remains consistent.
The following contains an excerpt of the contents of the curl file:
--create-dirs
--compressed
--location
--globoff
--fail
--fail-early
--continue-at -
--retry 2
--retry-delay 10
--write-out "Downloading to: %{filename_effective}\n\n"
# File '059ba534-61dd-46fa-aa3f-c4e2bb87c579', version '2022-03-03T19:03:25.282692Z' is currently not available in catalog 'lm2'.
url="https://service.azul.data.humancellatlas.org/repository/files/0ae73fa9-c5c5-45c6-aa83-ee3d57a4534c?catalog=lm2&version=2022-04-27T17%3A52%3A34.865533Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_UMAP.csv"
# File '0f60aa46-2715-40e3-8b69-66bc1b9e58dd', version '2022-03-03T19:02:08.127234Z' is currently not available in catalog 'lm2'.
… unavailable files …
# File '403c9e88-33a6-40cb-a833-95bec87b26c6', version '2022-03-03T19:01:41.120176Z' is currently not available in catalog 'lm2'.
url="https://service.azul.data.humancellatlas.org/repository/files/435736e6-d849-4199-b442-d40927a544a7?catalog=lm2&version=2022-04-27T17%3A52%3A34.876379Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_Annotation.csv"
# File '50f193f3-9d88-4d85-84c8-66f220eed41c', version '2022-03-03T19:02:49.854838Z' is currently not available in catalog 'lm2'.
url="https://service.azul.data.humancellatlas.org/repository/files/56b87b91-d9a8-497e-84fb-a305efc72a93?catalog=lm2&version=2022-04-27T17%3A52%3A34.822358Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_Annotation.csv"
url="https://service.azul.data.humancellatlas.org/repository/files/5765aa5f-dace-4704-adca-df6d4b032c44?catalog=lm2&version=2022-04-27T17%3A52%3A34.870286Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_UMAP.csv"
url="https://service.azul.data.humancellatlas.org/repository/files/5c04af32-8dbe-413d-977e-825b00da41af?catalog=lm2&version=2022-04-27T17%3A52%3A34.878291Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_UMAP.csv"
url="https://service.azul.data.humancellatlas.org/repository/files/5fbc93ac-015b-4d01-8b54-d248f018cb7d?catalog=lm2&version=2022-04-27T17%3A52%3A34.824764Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_Annotation.csv"
… unavailable files …
# File '6c022767-f781-427f-baa0-f577d86ac0be', version '2022-03-03T19:02:24.078730Z' is currently not available in catalog 'lm2'.
url="https://service.azul.data.humancellatlas.org/repository/files/6fbd0d84-d776-47b6-b736-f38127c9914d?catalog=lm2&version=2022-04-27T17%3A52%3A34.832603Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_UMAP.csv"
url="https://service.azul.data.humancellatlas.org/repository/files/77eb44f8-ff8e-494b-813c-4d154dcd9169?catalog=lm2&version=2022-04-27T17%3A52%3A34.857406Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_UMAP.csv"
… unavailable files …
url="https://service.azul.data.humancellatlas.org/repository/files/a9f39845-f123-4f36-a9ca-6c468bc1b1b0?catalog=lm2&version=2022-04-27T17%3A52%3A34.884856Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_UMAP.csv"
# File 'b221ab4e-915c-4296-96bb-48f6afd0e9f0', version '2022-03-03T19:03:05.014345Z' is currently not available in catalog 'lm2'.
url="https://service.azul.data.humancellatlas.org/repository/files/b244385b-68e4-4595-a519-04b621d57034?catalog=lm2&version=2022-04-27T17%3A52%3A34.843066Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_Annotation.csv"
# File 'b41d9042-4fa9-47cb-8bd0-8e5ff6ae5a6c', version '2022-03-03T19:02:22.144218Z' is currently not available in catalog 'lm2'.
… unavailable files …
# File 'dda021bd-f9e6-4dd4-8753-044238b86666', version '2022-03-03T19:03:28.259869Z' is currently not available in catalog 'lm2'.
url="https://service.azul.data.humancellatlas.org/repository/files/df3e567c-32cf-47a4-a167-84ca7668d7f1?catalog=lm2&version=2022-04-27T17%3A52%3A34.859522Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_Annotation.csv"
url="https://service.azul.data.humancellatlas.org/repository/files/e3d16047-7c37-4430-a1ce-7508696c6a98?catalog=lm2&version=2022-04-27T17%3A52%3A34.848979Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_Annotation.csv"
… unavailable files …
# File 'fd125d76-4fa7-471c-9726-99484e9fc6b3', version '2022-03-03T19:02:50.882865Z' is currently not available in catalog 'lm2'.
@hannes-ucsc to propose design to improve the user experience.
Azul will add a facet called isPhantom
, backed by a corresponding field in aggregate documents at contents.files.is_phantom
, that is either True
or False
depending on whether the contents.files.drs_path
field is None
or not. I don't think we need to have a corresponding field in contribution documents because we should be able to dynamically compute the aggregation input in the aggregator's _transform_entity
method.
In service responses, the field should be represented under hits
but also, and this is the point, under termFacets
. This way the Data Browser can easily determine if a particular filter combination yields phantom files i.e., files lacking a DRS URI, and warn the user if they are composing a curl
manifest that includes phantom files. Likewise, the Data Browser should prevent the user from composing a curl
manifest that consists solely of phantom files, since that manifest would download nothing.
Data Browser blockee is here: https://github.com/DataBiosphere/data-browser/issues/3032
Discovered by a LungMAP user. Files with null DRS URIs are not downloadable from the Files tab in the Data Browser but do appear in the file manifest for bulk file download via curl.