bowang-lab / scGPT

https://scgpt.readthedocs.io/en/latest/
MIT License
981 stars 185 forks source link

33 million normal cells not match CELLxGENE census version 15 May 2023 #239

Open litxiaoyao opened 1 month ago

litxiaoyao commented 1 month ago

There are 33 million cells all in CELLxGENE census version 15 May 2023, but only ~22M normal. image So is it wrong described in the paper? image

litxiaoyao commented 1 month ago

Besides, only with these 22M normal cells can end up to 51 tissues mentioned in the paper. 7d6d00a0f76d5aeba8ba69163e5d382

image

yubin-ai commented 1 month ago

can you check that what these numbers look like after filtering for is_primary_data == True in adata.obs?

litxiaoyao commented 1 month ago

All previous numbers are filtered with is_primary_data == True.

subercui commented 1 month ago

Hi @litxiaoyao , we didn't filetered by is_primary_data when we trained the model. You can find the code for downloading data in the data forlder. The reason was just we were not sure the exact meaning of what this is_primary_data means around the time, and we didn't find whether that means the cell is a duplicate or not. So we decided to train with all noraml cells by a straight-forward pool from cellxgene cencus, the training set we eventually have was a bit over 33 million.

yubin-ai commented 1 month ago

@subercui @litxiaoyao yes, I recently found that scGPT did not filter for that and it resulted in duplicates of cells. the difference in cell number is a result of that. I was about to open an issue and mention this but glad it is already discussed here.

take home message: always filter by is_primary_data == True

litxiaoyao commented 1 month ago

Hi, @subercui , however, there are ~37M healthy cells if don't set is_primary_data == True. image

litxiaoyao commented 3 weeks ago

The index files produced by the data folder can also gather 37489811 cells in total. image

litxiaoyao commented 3 weeks ago

Hi, @subercui , will this difference caused by the version of CxG between 15 May 2023 and 8 May 2023? So the real version is not 15 May 2023 but 8 May 2023 stated in the paper? image