chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

[builder] Error on reading manifest from Discover REST API #606

Closed bkmartinjr closed 1 year ago

bkmartinjr commented 1 year ago

We are recently seeing a new error logged by the builder:

2023-07-05 22:27:59 369254 ERROR Dataset id 4ed927e9-c099-49af-b8ce-a2652d069333 has more than one H5AD asset - ignoring this dataset

This is triggered the REST /curation/v1/datasets returning a duplicate assets record:


    "assets": [
      {
        "filesize": 1384188764,
        "filetype": "H5AD",
        "url": "https://datasets.cellxgene.cziscience.com/2cf76683-5eca-4bab-9d80-d0f9845e43af.h5ad"
      },
      {
        "filesize": 1384188764,
        "filetype": "H5AD",
        "url": "https://datasets.cellxgene.cziscience.com/2cf76683-5eca-4bab-9d80-d0f9845e43af.h5ad"
      },
      {
        "filesize": 1062808938,
        "filetype": "RDS",
        "url": "https://datasets.cellxgene.cziscience.com/2cf76683-5eca-4bab-9d80-d0f9845e43af.rds"
      }
    ],```

Depending on the API semantics, which are unclear (question has been raised with the Portal team on meaning of the above), the builder should:
* log as an error, and stop / exit
* or, log as a warning and handle appropriately (very unlikely that dropping the dataset is correct behavior)

Waiting to hear back from the API team on how to interpret this case.

CC: @danieljhegeman 
danieljhegeman commented 1 year ago

Artifact rows are duplicated in the db. Same is true for another public Dataset 8e10f1c4-8e98-41e5-b65f-8cd89a887122, as well as 4 more private Datasets. No leads on how this came to be. One is from before redesign; 5 after.

bkmartinjr commented 1 year ago

Follow up from @danieljhegeman indicate that this is a bug. The Census builder should error out if this occurs.

Related to chanzuckerberg/single-cell-data-portal#5136