chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

2 Questions for the LTS Census Data Releases #1101

Closed zhan8855 closed 2 months ago

zhan8855 commented 3 months ago

Hi, thank you for your awesome work! I would greatly appreciate it if you could answer the following questions:

  1. Why the number of h5ad files in the LTS release is not equal to the number of datasets? (e.g. There are 1114 files in 2023-12-15 release, while there are only 651 datasets as it was reported on the official website.)

  2. We found a list of datasets occurred in 2023-05-15 release, but no longer appears in 2023-12-15 release. Why are they deleted? Is it possible that the duplicated cells from these datasets would occur in later LTS release?

0caedec7-1c7d-4e79-aba2-50f6916e643f.h5ad 1b699e04-1127-42ea-998b-011ace4a5b81.h5ad 30498543-4fdd-4f86-9e1b-05c1a1454a6a.h5ad 44c93f2b-dd66-4d15-81ef-de9394c76290.h5ad 6a270451-b4d9-43e0-aa89-e33aac1ac74b.h5ad 87ce26ed-e5d1-44b4-81cc-cc5b709a169f.h5ad 97d9238c-1a39-4873-b0bb-963ec2d788e6.h5ad b252b015-b488-4d5c-b16e-968c13e48a2c.h5ad b5191f01-f67d-44b8-bc8d-511a4ecd07bb.h5ad d6dfdef1-406d-4efb-808c-3c5eddbfe0cb.h5ad e3a7e927-2632-4575-993d-d0905cd5da8b.h5ad e40c6272-af77-4a10-9385-62a398884f27.h5ad e463dae9-3fc1-476d-870e-d98a04c56cd6.h5ad

Thank you very much in advance.

ebezzi commented 2 months ago

Hello @zhan8855,

  1. Historically the S3 h5ads Census directory contained all the h5ads present in CELLxGENE Discover database at the time of building, even the ones that did not pass the filter and did not end up getting included in the Census build. This behavior has been changed in January and in recent versions the numbers match.
  2. Some of those datasets were pulled from the Census because some duplicate observations were erroneous flagged as is_primary_data. You can see the details in this issue. In addition, datasets can be removed or revised at any point by the contributors, which in turn will remove them from future Census LTS releases. We are still defining our data deletion policy for Census, but so far we still consider any Census release immutable and therefore the removed datasets will stay in the old version of Census.

Let me know if you have any other questions.

pablo-gar commented 2 months ago

Closing. @zhan8855 please open back if we have not fully answered your inquiry.

zhan8855 commented 2 months ago

Sorry for the late reply. Thank you so much!!