DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Add `isPhantom` facet #4851

Open theathorn opened 1 year ago

theathorn commented 1 year ago

Discovered by a LungMAP user. Files with null DRS URIs are not downloadable from the Files tab in the Data Browser but do appear in the file manifest for bulk file download via curl.

theathorn commented 1 year ago

Spike to reproduce. This project.

achave11-ucsc commented 1 year ago

It seems like no files (fastq.gz format, from the specified project) are downloadable via curl when they also lack DRS URIs download links in the files tab in the Data Browser. The following curl command requests said files, and fails to execute:

curl --location --fail 'https://service.azul.data.humancellatlas.org/manifest/files?catalog=lm2&format=curl&filters=%7B%22genusSpecies%22%3A+%7B%22is%22%3A+%5B%22Homo+sapiens%22%5D%7D%2C+%22fileFormat%22%3A+%7B%22is%22%3A+%5B%22fastq.gz%22%5D%7D%7D&objectKey=manifests%2F1ac72f75-b538-5e6c-a3d2-ac205a3adaab.6b76de54-7552-5fa5-b96c-db21be0be76b.curlrc' | curl --config -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2977  100  2977    0     0   7340      0 --:--:-- --:--:-- --:--:--  7442
100 14718  100 14718    0     0  18765      0 --:--:-- --:--:-- --:--:-- 18765
curl: no URL specified!
curl: try 'curl --help' or 'curl --manual' for more information

The following is an excerpt of the curl file:


--create-dirs

--compressed

--location

--globoff

--fail

--fail-early

--continue-at -

--retry 2

--retry-delay 10

--write-out "Downloading to: %{filename_effective}\n\n"

# File '017bab09-21e9-4ca4-aa5b-fde7545b77b2', version '2022-02-22T18:41:19.452809Z' is currently not available in catalog 'lm2'.

# File '059ba534-61dd-46fa-aa3f-c4e2bb87c579', version '2022-03-03T19:03:25.282692Z' is currently not available in catalog 'lm2'.

# File '09720403-be2c-449a-8234-f9e6a871d785', version '2022-02-22T18:38:43.917466Z' is currently not available in catalog 'lm2'.

# File '0f60aa46-2715-40e3-8b69-66bc1b9e58dd', version '2022-03-03T19:02:08.127234Z' is currently not available in catalog 'lm2'.

… more unavailable files …

# File 'f4fecb67-2902-49fd-be31-2283bef1304d', version '2022-03-03T19:01:44.104344Z' is currently not available in catalog 'lm2'.

# File 'f8b4d49b-0eca-4681-aadf-04fc5b70e0f2', version '2022-03-03T19:03:03.008575Z' is currently not available in catalog 'lm2'.

# File 'f8ce0bfd-9e23-443d-8d66-31229abbdfa4', version '2022-02-22T18:41:32.165967Z' is currently not available in catalog 'lm2'.

# File 'fbc34d0f-8a19-453a-8d0f-2ecdd76396be', version '2022-03-03T19:01:52.073110Z' is currently not available in catalog 'lm2'.

# File 'fc7694e0-ca99-433e-b437-e466fec229f4', version '2022-03-03T19:02:26.742226Z' is currently not available in catalog 'lm2'.

# File 'fd125d76-4fa7-471c-9726-99484e9fc6b3', version '2022-03-03T19:02:50.882865Z' is currently not available in catalog 'lm2'.

# File 'fdf15d5f-5521-492b-b781-e2c887e3affd', version '2022-02-22T18:41:17.657769Z' is currently not available in catalog 'lm2'.
hannes-ucsc commented 1 year ago

So far this is the expected behavior when downloading a curl manifest in which all files are unavailable. Please try again with a mix of files, some available, some not. The expected behavior is that the available files are downloaded.

achave11-ucsc commented 1 year ago

The following curl command was requested with the same project selected, and the same filter on fastq.gz (unavailable) file format with the addition of csv format type. The csv type files are available for download from the files tab in the Data Browser:

curl --location --fail 'https://service.azul.data.humancellatlas.org/manifest/files?catalog=lm2&format=curl&filters=%7B%22projectId%22%3A+%7B%22is%22%3A+%5B%22f899709c-ae2c-4bb9-88f0-131142e6c7ec%22%5D%7D%2C+%22fileFormat%22%3A+%7B%22is%22%3A+%5B%22fastq.gz%22%2C+%22csv%22%5D%7D%2C+%22genusSpecies%22%3A+%7B%22is%22%3A+%5B%22Homo+sapiens%22%5D%7D%7D&objectKey=manifests%2F20735f69-b4c5-5ac2-8b05-45a9dfc3105d.6b76de54-7552-5fa5-b96c-db21be0be76b.curlrc' | curl --config -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2917  100  2917    0     0   1938      0  0:00:01  0:00:01 --:--:--  1949
100 10206  100 10206    0     0   5151      0  0:00:01  0:00:01 --:--:-- 3322k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
100 3826k  100 3826k    0     0   691k      0  0:00:05  0:00:05 --:--:-- 5815k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_UMAP.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 9548k  100 9548k    0     0  3421k      0  0:00:02  0:00:02 --:--:-- 15.4M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_Annotation.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 3282k  100 3282k    0     0  1311k      0  0:00:02  0:00:02 --:--:-- 1311k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_Annotation.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
100 2976k  100 2976k    0     0   539k      0  0:00:05  0:00:05 --:--:-- 7189k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_UMAP.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1025k  100 1025k    0     0   455k      0  0:00:02  0:00:02 --:--:--  455k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_UMAP.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1820k  100 1820k    0     0   866k      0  0:00:02  0:00:02 --:--:--  866k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_Annotation.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1027k  100 1027k    0     0   463k      0  0:00:02  0:00:02 --:--:-- 11.9M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_UMAP.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1205k  100 1205k    0     0   545k      0  0:00:02  0:00:02 --:--:-- 9881k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_UMAP.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1063k  100 1063k    0     0   503k      0  0:00:02  0:00:02 --:--:-- 19.0M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_UMAP.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 2209k  100 2209k    0     0   993k      0  0:00:02  0:00:02 --:--:--  993k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_Annotation.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1858k  100 1858k    0     0   844k      0  0:00:02  0:00:02 --:--:--  844k
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_Annotation.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
100 6893k  100 6893k    0     0  1203k      0  0:00:05  0:00:05 --:--:-- 15.6M
Downloading to: 31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_Annotation.csv

$tree
└── 31244706-60f8-3043-8db7-a39fd7081139
    ├── PeripCOPD_AllCells_Annotation.csv
    ├── PeripCOPD_AllCells_UMAP.csv
    ├── PeripCOPD_EpiCells_Annotation.csv
    ├── PeripCOPD_EpiCells_UMAP.csv
    ├── PeriphNorm_AllCells_Annotation.csv
    ├── PeriphNorm_AllCells_UMAP.csv
    ├── PeriphNorm_EpiCells_Annotation.csv
    ├── PeriphNorm_EpiCells_UMAP.csv
    ├── ProxhNorm_AllCells_Annotation.csv
    ├── ProxhNorm_AllCells_UMAP.csv
    ├── ProxhNorm_EpiCells_Annotation.csv
    └── ProxhNorm_EpiCells_UMAP.csv

1 directory, 12 files

The available files (csv format) were downloaded as expected and the fasq.gz were not. The expected behavior remains consistent.

The following contains an excerpt of the contents of the curl file:

--create-dirs

--compressed

--location

--globoff

--fail

--fail-early

--continue-at -

--retry 2

--retry-delay 10

--write-out "Downloading to: %{filename_effective}\n\n"

# File '059ba534-61dd-46fa-aa3f-c4e2bb87c579', version '2022-03-03T19:03:25.282692Z' is currently not available in catalog 'lm2'.

url="https://service.azul.data.humancellatlas.org/repository/files/0ae73fa9-c5c5-45c6-aa83-ee3d57a4534c?catalog=lm2&version=2022-04-27T17%3A52%3A34.865533Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_UMAP.csv"

# File '0f60aa46-2715-40e3-8b69-66bc1b9e58dd', version '2022-03-03T19:02:08.127234Z' is currently not available in catalog 'lm2'.

… unavailable files …

# File '403c9e88-33a6-40cb-a833-95bec87b26c6', version '2022-03-03T19:01:41.120176Z' is currently not available in catalog 'lm2'.

url="https://service.azul.data.humancellatlas.org/repository/files/435736e6-d849-4199-b442-d40927a544a7?catalog=lm2&version=2022-04-27T17%3A52%3A34.876379Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_Annotation.csv"

# File '50f193f3-9d88-4d85-84c8-66f220eed41c', version '2022-03-03T19:02:49.854838Z' is currently not available in catalog 'lm2'.

url="https://service.azul.data.humancellatlas.org/repository/files/56b87b91-d9a8-497e-84fb-a305efc72a93?catalog=lm2&version=2022-04-27T17%3A52%3A34.822358Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_Annotation.csv"

url="https://service.azul.data.humancellatlas.org/repository/files/5765aa5f-dace-4704-adca-df6d4b032c44?catalog=lm2&version=2022-04-27T17%3A52%3A34.870286Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_AllCells_UMAP.csv"

url="https://service.azul.data.humancellatlas.org/repository/files/5c04af32-8dbe-413d-977e-825b00da41af?catalog=lm2&version=2022-04-27T17%3A52%3A34.878291Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_UMAP.csv"

url="https://service.azul.data.humancellatlas.org/repository/files/5fbc93ac-015b-4d01-8b54-d248f018cb7d?catalog=lm2&version=2022-04-27T17%3A52%3A34.824764Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_Annotation.csv"

… unavailable files …

# File '6c022767-f781-427f-baa0-f577d86ac0be', version '2022-03-03T19:02:24.078730Z' is currently not available in catalog 'lm2'.

url="https://service.azul.data.humancellatlas.org/repository/files/6fbd0d84-d776-47b6-b736-f38127c9914d?catalog=lm2&version=2022-04-27T17%3A52%3A34.832603Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_EpiCells_UMAP.csv"

url="https://service.azul.data.humancellatlas.org/repository/files/77eb44f8-ff8e-494b-813c-4d154dcd9169?catalog=lm2&version=2022-04-27T17%3A52%3A34.857406Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_UMAP.csv"

… unavailable files …

url="https://service.azul.data.humancellatlas.org/repository/files/a9f39845-f123-4f36-a9ca-6c468bc1b1b0?catalog=lm2&version=2022-04-27T17%3A52%3A34.884856Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeripCOPD_EpiCells_UMAP.csv"

# File 'b221ab4e-915c-4296-96bb-48f6afd0e9f0', version '2022-03-03T19:03:05.014345Z' is currently not available in catalog 'lm2'.

url="https://service.azul.data.humancellatlas.org/repository/files/b244385b-68e4-4595-a519-04b621d57034?catalog=lm2&version=2022-04-27T17%3A52%3A34.843066Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_AllCells_Annotation.csv"

# File 'b41d9042-4fa9-47cb-8bd0-8e5ff6ae5a6c', version '2022-03-03T19:02:22.144218Z' is currently not available in catalog 'lm2'.

… unavailable files …

# File 'dda021bd-f9e6-4dd4-8753-044238b86666', version '2022-03-03T19:03:28.259869Z' is currently not available in catalog 'lm2'.

url="https://service.azul.data.humancellatlas.org/repository/files/df3e567c-32cf-47a4-a167-84ca7668d7f1?catalog=lm2&version=2022-04-27T17%3A52%3A34.859522Z"
output="31244706-60f8-3043-8db7-a39fd7081139/ProxhNorm_EpiCells_Annotation.csv"

url="https://service.azul.data.humancellatlas.org/repository/files/e3d16047-7c37-4430-a1ce-7508696c6a98?catalog=lm2&version=2022-04-27T17%3A52%3A34.848979Z"
output="31244706-60f8-3043-8db7-a39fd7081139/PeriphNorm_AllCells_Annotation.csv"

… unavailable files …

# File 'fd125d76-4fa7-471c-9726-99484e9fc6b3', version '2022-03-03T19:02:50.882865Z' is currently not available in catalog 'lm2'.
theathorn commented 1 year ago

@hannes-ucsc to propose design to improve the user experience.

hannes-ucsc commented 1 year ago

Azul will add a facet called isPhantom, backed by a corresponding field in aggregate documents at contents.files.is_phantom, that is either True or False depending on whether the contents.files.drs_path field is None or not. I don't think we need to have a corresponding field in contribution documents because we should be able to dynamically compute the aggregation input in the aggregator's _transform_entity method.

In service responses, the field should be represented under hits but also, and this is the point, under termFacets. This way the Data Browser can easily determine if a particular filter combination yields phantom files i.e., files lacking a DRS URI, and warn the user if they are composing a curl manifest that includes phantom files. Likewise, the Data Browser should prevent the user from composing a curl manifest that consists solely of phantom files, since that manifest would download nothing.

hannes-ucsc commented 1 year ago

Data Browser blockee is here: https://github.com/DataBiosphere/data-browser/issues/3032