Open YubinXie opened 1 month ago
Thanks for reporting, I can reproduce this. Code for that:
import cellxgene_census
prev_census = cellxgene_census.open_soma(census_version="2023-05-15")
curr_census = cellxgene_census.open_soma()
prev_obs = cellxgene_census.get_obs(prev_census, "homo_sapiens")
curr_obs = cellxgene_census.get_obs(curr_census, "homo_sapiens")
cancer_cell_types = [
"malignant ovarian serous tumor",
"glioblastoma",
"lung adenocarcinoma",
"squamous cell lung carcinoma",
"small cell lung carcinoma",
"non-small cell lung carcinoma",
"B-cell non-Hodgkin lymphoma",
"follicular lymphoma",
"gastric cancer",
"blastoma",
"pilocytic astrocytoma",
"acute myeloid leukemia",
"tubular adenoma",
"clear cell renal carcinoma",
"adenocarcinoma",
"tubulovillous adenoma",
"colorectal cancer",
"Wilms tumor",
"acute promyelocytic leukemia",
"neuroendocrine carcinoma",
"chromophobe renal cell carcinoma",
]
# More cells with these labels are in the old release than the current one
assert curr_obs["disease"].isin(cancer_cell_types).sum() < prev_obs["disease"].isin(cancer_cell_types).sum()
curr_cells = curr_obs[curr_obs["disease"].isin(cancer_cell_types)]
prev_cells = prev_obs[prev_obs["disease"].isin(cancer_cell_types)]
display(curr_cells["disease"].value_counts().head(20))
display(prev_cells["disease"].value_counts().head(20))
disease
glioblastoma 1477830
lung adenocarcinoma 1422977
squamous cell lung carcinoma 303260
small cell lung carcinoma 246800
non-small cell lung carcinoma 241592
clear cell renal carcinoma 187792
follicular lymphoma 122702
gastric cancer 116329
B-cell non-Hodgkin lymphoma 59746
blastoma 57445
pilocytic astrocytoma 34291
acute myeloid leukemia 27852
tubular adenoma 27270
adenocarcinoma 11483
tubulovillous adenoma 7216
colorectal cancer 6215
Wilms tumor 4636
acute promyelocytic leukemia 3734
neuroendocrine carcinoma 2623
chromophobe renal cell carcinoma 2576
Name: count, dtype: int64
disease
malignant ovarian serous tumor 1553549
glioblastoma 1477830
lung adenocarcinoma 1422977
squamous cell lung carcinoma 303260
small cell lung carcinoma 244903
non-small cell lung carcinoma 241592
B-cell non-Hodgkin lymphoma 133405
follicular lymphoma 122702
gastric cancer 116329
blastoma 57445
pilocytic astrocytoma 34291
acute myeloid leukemia 27852
tubular adenoma 27248
clear cell renal carcinoma 20509
adenocarcinoma 11483
tubulovillous adenoma 7219
colorectal cancer 6215
Wilms tumor 4636
acute promyelocytic leukemia 3734
neuroendocrine carcinoma 2623
Name: count, dtype: int64
We can see that "malignant ovarian serous tumor" doesn't seem to show up in the current census
And also that it shows up in a small set of datasets
most_datasets = prev_cells[prev_cells["disease"] == "malignant ovarian serous tumor"][
"dataset_id"
].value_counts()
display(most_datasets)
dataset_id
b252b015-b488-4d5c-b16e-968c13e48a2c 929690
e3a7e927-2632-4575-993d-d0905cd5da8b 221315
44c93f2b-dd66-4d15-81ef-de9394c76290 211624
0caedec7-1c7d-4e79-aba2-50f6916e643f 166895
97d9238c-1a39-4873-b0bb-963ec2d788e6 24025
Name: count, dtype: int64
None of these datasets show up in the latest LTS:
curr_obs["dataset_id"].isin(most_datasets.index).sum()
0
And they all seem to be in our dataset blocklist. I am not sure if this accounts for all the missing cells here, but seems to account for a lot.
@jahilton, do you know why these datasets are on the blocklist currently? Do we expect them to come off the blocklist?
Some samples in that Collection were found to be duplicated within the Datasets so all of the Datasets were removed from Census while the contributors investigate and possibly redo some of their analysis.
Previous issue for this:
@YubinXie, in the meantime, you can get the data for this dataset from the data portal here: https://cellxgene.cziscience.com/collections/4796c91c-9d8f-4692-be43-347b1727f9d8
Hi @ivirshup @jahilton
Thanks for checking. The context is helpful (and good to know that dataset is from our MSK...).
So I checked the malignant ovarian serous tumor
dataset in cellxgene, and you can see the picture below, it matches with the datasets you @ivirshup found in the missing studies. Based on my understanding, the all cell dataset includes all the rest ones and that could be where duplicates happen? If you just remove the rest 3 and use the all cell
one, it should be fine? Let me know if it makes sense.
My understanding is that there are ~4k cell in the "All cells" dataset which we suspect may have been included in that dataset twice. That match is based off the expression profile being the same, but I'm also unfamiliar with the specifics of this check.
@ivirshup got it.
the 2023-05-15 has 1,553,549 malignant ovarian serous tumor cells and this also seems to have duplicates from the datasets as the all cell
one only has 0.9 M cells?
I can provide more specifics that might help explain the duplication.
All of the Datasets from this Collection were initially included in Census. The "All cells" Dataset contains every cell in the other Datasets, plus some additional cells. This type of cross-Dataset duplication is handled by annotating _is_primarydata. We aim to have every cell in the corpus annotated as _is_primarydata:True
exactly once. So the "All cells" is all True
while the other Datasets are all False
.
(so if you were to query all of the "malignant ovarian serous tumor" cells in Census without also filtering for _is_primarydata:True
, you would get duplicated cells)
Our QA detected that 2443 cells in the "All cells" were included twice each - one instance assigned to _author_sampleid values SPECTRUM-OV-007_S1_CD45N_RIGHT_ADNEXA
and another to SPECTRUM-OV-007_S1_CD45N_LEFT_ADNEXA
. Out of an abundance of caution, we blocked all of the Datasets from Census until the root cause is identified by the contributor, and hope to soon revise the Datasets.
this is super helpful @jahilton. I want to check one more thing. the 2023-05-15 has 1.5M ovarian cancer cells while the all cell in ovarian is 0.9M. Does this mean there are 0.6M duplicates in 2023-05-15 dataset?
That is correct.
There are many duplicates in each Census version - due to cases like this where there is an "all cells" Dataset plus subset Datasets & also if the same samples appear different submissions (e.g. integration studies).
Filtering for _is_primarydata:True
will de-dup your results.
This is great to know. Thanks @jahilton!
some additional finding to share:
I find using is_primary_data:True super helpful. after the filtering, I found some datasets with as few as 36 cells. I looked into it and it turns out that it is from my own paper (our SCLC paper - tho I did not process the scRNA data). It seems there are some cells in immune dataset that are not included in the all cell
dataset.
it is not a big issue but I want to share it here. I did not realize the is_primary_data:True tag is on cellular level instead of dataset level.
Hi cellxgene-census team, I am using cellxgene to train a foundation model, and I used scGPT and cellxgene census tool to download scRNA data. When I set the release date as "2023-05-15", I have 5M+ cells with the following cancer type. But when I set it as "latest", it only returns 4M+ cells. Is this a bug? I dont see any release note indicating cancer name change or removal of datasets. It would be great to know what happened and which dataset is better for pan-cancer model training. Thank you.