Closed mengerj closed 4 weeks ago
Hey @mengerj, only data that meet a certain set of criteria from the complete data corpus are included in the census and accessible via the API. You can find those criteria here
Let us know if the dataset you are trying to programmatically access meets all those criteria but is not accessible via the API.
@mengerj
Max is right about some datasets not being in Census. However for your example the dataset is indeed in Census.
The issue that you are facing is that there is no guarantee that the UUID you see in the URLs is the latest dataset_id
for the dataset.
The URLs are meant to be immutable and permanent, however our datasets get revised over time and their associated dataset_id
change.
I'd recommend always fetching the dataset_id
directly from the Census dataset table: census["census_info"]["datasets"]
I have attached an updated version of your notebook that shows you how to fetch your data of interest. As a side question, are you specifically training models with the Census data?
https://colab.research.google.com/drive/10hB986JjLnT0xAU0qa2LWq5iKXR-ZBUx?authuser=1
I'd also encourage you to ask more questions in our slack channel #cellxgene-census-users
. https://czi.co/science-slack
Thank you very much for taking the time to update the notebook, using the collection name works great.
And yes, I do want to train models with the Census data.
@mengerj I'm glad it's working for you. I wanted to clarify two pieces for you. dataset_id
is also permanent, what is not is the dataset_version_id
-- that was an erratum on my end. However the URLs and dataset_id
are indeed not guaranteed to be the same.
Lastly, I realized there are two collections with almost identical names:
And it appears you wanted to access the former, which has been recently added to CELLxGENE. Please make sure to open the latest version of Census to have access to it, you can do so when calling open_soma
census = cellxgene_census.open_soma(census_version="latest")
Hi,
I am trying to access specific datasets through the API. For collection of interest, I use the collection_id from the URL, and use request as shown in the documention, to get the dataset_ids from the collection. I want to load this data and tried the experiment datapipe as well as the gget approach. It generally works, but not for dataset_ids I find within the browser, only for such that are present in census["census_data"]["homo_sapiens"].obs.read(column_names=["dataset_id"]).concat().to_pandas(). Are not all datasets found in the browser accessible through the API or am I doing something worng? I included a link to the script I am using to try to access the data.
https://github.com/mengerj/issues/blob/e4e2d523ca858a6277685b536aa649a596a575fb/cellxgene_datasetid_issue.ipynb
Many Thanks!