Current stable data release has fewer cancer cells than "2023-05-15"

YubinXie commented 1 month ago

Hi cellxgene-census team, I am using cellxgene to train a foundation model, and I used scGPT and cellxgene census tool to download scRNA data. When I set the release date as "2023-05-15", I have 5M+ cells with the following cancer type. But when I set it as "latest", it only returns 4M+ cells. Is this a bug? I dont see any release note indicating cancer name change or removal of datasets. It would be great to know what happened and which dataset is better for pan-cancer model training. Thank you.

malignant ovarian serous tumor
glioblastoma
lung adenocarcinoma
squamous cell lung carcinoma
small cell lung carcinoma
non-small cell lung carcinoma
B-cell non-Hodgkin lymphoma
follicular lymphoma
gastric cancer
blastoma
pilocytic astrocytoma
acute myeloid leukemia
tubular adenoma
clear cell renal carcinoma
adenocarcinoma
tubulovillous adenoma
colorectal cancer
Wilms tumor
acute promyelocytic leukemia
neuroendocrine carcinoma
chromophobe renal cell carcinoma

ivirshup commented 1 month ago

Thanks for reporting, I can reproduce this. Code for that:

import cellxgene_census

prev_census = cellxgene_census.open_soma(census_version="2023-05-15")
curr_census = cellxgene_census.open_soma()

prev_obs = cellxgene_census.get_obs(prev_census, "homo_sapiens")
curr_obs = cellxgene_census.get_obs(curr_census, "homo_sapiens")

cancer_cell_types = [
    "malignant ovarian serous tumor",
    "glioblastoma",
    "lung adenocarcinoma",
    "squamous cell lung carcinoma",
    "small cell lung carcinoma",
    "non-small cell lung carcinoma",
    "B-cell non-Hodgkin lymphoma",
    "follicular lymphoma",
    "gastric cancer",
    "blastoma",
    "pilocytic astrocytoma",
    "acute myeloid leukemia",
    "tubular adenoma",
    "clear cell renal carcinoma",
    "adenocarcinoma",
    "tubulovillous adenoma",
    "colorectal cancer",
    "Wilms tumor",
    "acute promyelocytic leukemia",
    "neuroendocrine carcinoma",
    "chromophobe renal cell carcinoma",
]

# More cells with these labels are in the old release than the current one
assert curr_obs["disease"].isin(cancer_cell_types).sum() < prev_obs["disease"].isin(cancer_cell_types).sum()

curr_cells = curr_obs[curr_obs["disease"].isin(cancer_cell_types)]
prev_cells = prev_obs[prev_obs["disease"].isin(cancer_cell_types)]

display(curr_cells["disease"].value_counts().head(20))
display(prev_cells["disease"].value_counts().head(20))

disease
glioblastoma                        1477830
lung adenocarcinoma                 1422977
squamous cell lung carcinoma         303260
small cell lung carcinoma            246800
non-small cell lung carcinoma        241592
clear cell renal carcinoma           187792
follicular lymphoma                  122702
gastric cancer                       116329
B-cell non-Hodgkin lymphoma           59746
blastoma                              57445
pilocytic astrocytoma                 34291
acute myeloid leukemia                27852
tubular adenoma                       27270
adenocarcinoma                        11483
tubulovillous adenoma                  7216
colorectal cancer                      6215
Wilms tumor                            4636
acute promyelocytic leukemia           3734
neuroendocrine carcinoma               2623
chromophobe renal cell carcinoma       2576
Name: count, dtype: int64

disease
malignant ovarian serous tumor    1553549
glioblastoma                      1477830
lung adenocarcinoma               1422977
squamous cell lung carcinoma       303260
small cell lung carcinoma          244903
non-small cell lung carcinoma      241592
B-cell non-Hodgkin lymphoma        133405
follicular lymphoma                122702
gastric cancer                     116329
blastoma                            57445
pilocytic astrocytoma               34291
acute myeloid leukemia              27852
tubular adenoma                     27248
clear cell renal carcinoma          20509
adenocarcinoma                      11483
tubulovillous adenoma                7219
colorectal cancer                    6215
Wilms tumor                          4636
acute promyelocytic leukemia         3734
neuroendocrine carcinoma             2623
Name: count, dtype: int64

We can see that "malignant ovarian serous tumor" doesn't seem to show up in the current census

And also that it shows up in a small set of datasets

most_datasets = prev_cells[prev_cells["disease"] == "malignant ovarian serous tumor"][
    "dataset_id"
].value_counts()
display(most_datasets)

dataset_id
b252b015-b488-4d5c-b16e-968c13e48a2c    929690
e3a7e927-2632-4575-993d-d0905cd5da8b    221315
44c93f2b-dd66-4d15-81ef-de9394c76290    211624
0caedec7-1c7d-4e79-aba2-50f6916e643f    166895
97d9238c-1a39-4873-b0bb-963ec2d788e6     24025
Name: count, dtype: int64

None of these datasets show up in the latest LTS:

curr_obs["dataset_id"].isin(most_datasets.index).sum()

And they all seem to be in our dataset blocklist. I am not sure if this accounts for all the missing cells here, but seems to account for a lot.

@jahilton, do you know why these datasets are on the blocklist currently? Do we expect them to come off the blocklist?

jahilton commented 1 month ago

Some samples in that Collection were found to be duplicated within the Datasets so all of the Datasets were removed from Census while the contributors investigate and possibly redo some of their analysis.

ivirshup commented 1 month ago

Previous issue for this:

https://github.com/chanzuckerberg/single-cell-curation/issues/528

@YubinXie, in the meantime, you can get the data for this dataset from the data portal here: https://cellxgene.cziscience.com/collections/4796c91c-9d8f-4692-be43-347b1727f9d8

YubinXie commented 1 month ago

Hi @ivirshup @jahilton Thanks for checking. The context is helpful (and good to know that dataset is from our MSK...). So I checked the malignant ovarian serous tumor dataset in cellxgene, and you can see the picture below, it matches with the datasets you @ivirshup found in the missing studies. Based on my understanding, the all cell dataset includes all the rest ones and that could be where duplicates happen? If you just remove the rest 3 and use the all cell one, it should be fine? Let me know if it makes sense.

ivirshup commented 1 month ago

My understanding is that there are ~4k cell in the "All cells" dataset which we suspect may have been included in that dataset twice. That match is based off the expression profile being the same, but I'm also unfamiliar with the specifics of this check.

YubinXie commented 1 month ago

@ivirshup got it. the 2023-05-15 has 1,553,549 malignant ovarian serous tumor cells and this also seems to have duplicates from the datasets as the all cell one only has 0.9 M cells?

jahilton commented 1 month ago

I can provide more specifics that might help explain the duplication. All of the Datasets from this Collection were initially included in Census. The "All cells" Dataset contains every cell in the other Datasets, plus some additional cells. This type of cross-Dataset duplication is handled by annotating _is_primarydata. We aim to have every cell in the corpus annotated as _is_primarydata:True exactly once. So the "All cells" is all True while the other Datasets are all False. (so if you were to query all of the "malignant ovarian serous tumor" cells in Census without also filtering for _is_primarydata:True, you would get duplicated cells)

Our QA detected that 2443 cells in the "All cells" were included twice each - one instance assigned to _author_sampleid values SPECTRUM-OV-007_S1_CD45N_RIGHT_ADNEXA and another to SPECTRUM-OV-007_S1_CD45N_LEFT_ADNEXA. Out of an abundance of caution, we blocked all of the Datasets from Census until the root cause is identified by the contributor, and hope to soon revise the Datasets.

YubinXie commented 1 month ago

this is super helpful @jahilton. I want to check one more thing. the 2023-05-15 has 1.5M ovarian cancer cells while the all cell in ovarian is 0.9M. Does this mean there are 0.6M duplicates in 2023-05-15 dataset?

jahilton commented 1 month ago

That is correct. There are many duplicates in each Census version - due to cases like this where there is an "all cells" Dataset plus subset Datasets & also if the same samples appear different submissions (e.g. integration studies). Filtering for _is_primarydata:True will de-dup your results.

YubinXie commented 1 month ago

This is great to know. Thanks @jahilton!

YubinXie commented 1 month ago

some additional finding to share: I find using is_primary_data:True super helpful. after the filtering, I found some datasets with as few as 36 cells. I looked into it and it turns out that it is from my own paper (our SCLC paper - tho I did not process the scRNA data). It seems there are some cells in immune dataset that are not included in the all cell dataset. it is not a big issue but I want to share it here. I did not realize the is_primary_data:True tag is on cellular level instead of dataset level.

chanzuckerberg / cellxgene-census

Current stable data release has fewer cancer cells than "2023-05-15" #1255