chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 18 forks source link

Can't fiind dataset in API with dataset_id from browser #1175

Closed mengerj closed 4 weeks ago

mengerj commented 1 month ago

Hi,

I am trying to access specific datasets through the API. For collection of interest, I use the collection_id from the URL, and use request as shown in the documention, to get the dataset_ids from the collection. I want to load this data and tried the experiment datapipe as well as the gget approach. It generally works, but not for dataset_ids I find within the browser, only for such that are present in census["census_data"]["homo_sapiens"].obs.read(column_names=["dataset_id"]).concat().to_pandas(). Are not all datasets found in the browser accessible through the API or am I doing something worng? I included a link to the script I am using to try to access the data.

https://github.com/mengerj/issues/blob/e4e2d523ca858a6277685b536aa649a596a575fb/cellxgene_datasetid_issue.ipynb

Many Thanks!

MaximilianLombardo commented 1 month ago

Hey @mengerj, only data that meet a certain set of criteria from the complete data corpus are included in the census and accessible via the API. You can find those criteria here

Let us know if the dataset you are trying to programmatically access meets all those criteria but is not accessible via the API.

pablo-gar commented 1 month ago

@mengerj

Max is right about some datasets not being in Census. However for your example the dataset is indeed in Census.

The issue that you are facing is that there is no guarantee that the UUID you see in the URLs is the latest dataset_id for the dataset.

The URLs are meant to be immutable and permanent, however our datasets get revised over time and their associated dataset_id change.

I'd recommend always fetching the dataset_id directly from the Census dataset table: census["census_info"]["datasets"]

I have attached an updated version of your notebook that shows you how to fetch your data of interest. As a side question, are you specifically training models with the Census data?

https://colab.research.google.com/drive/10hB986JjLnT0xAU0qa2LWq5iKXR-ZBUx?authuser=1

I'd also encourage you to ask more questions in our slack channel #cellxgene-census-users. https://czi.co/science-slack

mengerj commented 1 month ago

Thank you very much for taking the time to update the notebook, using the collection name works great.

And yes, I do want to train models with the Census data.

pablo-gar commented 4 weeks ago

@mengerj I'm glad it's working for you. I wanted to clarify two pieces for you. dataset_id is also permanent, what is not is the dataset_version_id -- that was an erratum on my end. However the URLs and dataset_id are indeed not guaranteed to be the same.

Lastly, I realized there are two collections with almost identical names:

And it appears you wanted to access the former, which has been recently added to CELLxGENE. Please make sure to open the latest version of Census to have access to it, you can do so when calling open_soma

census = cellxgene_census.open_soma(census_version="latest")