Open litxiaoyao opened 1 month ago
Besides, only with these 22M normal cells can end up to 51 tissues mentioned in the paper.
can you check that what these numbers look like after filtering for is_primary_data == True in adata.obs?
All previous numbers are filtered with is_primary_data == True.
Hi @litxiaoyao , we didn't filetered by is_primary_data when we trained the model. You can find the code for downloading data in the data forlder. The reason was just we were not sure the exact meaning of what this is_primary_data means around the time, and we didn't find whether that means the cell is a duplicate or not. So we decided to train with all noraml cells by a straight-forward pool from cellxgene cencus, the training set we eventually have was a bit over 33 million.
@subercui @litxiaoyao yes, I recently found that scGPT did not filter for that and it resulted in duplicates of cells. the difference in cell number is a result of that. I was about to open an issue and mention this but glad it is already discussed here.
take home message:
always filter by is_primary_data == True
Hi, @subercui , however, there are ~37M healthy cells if don't set is_primary_data == True.
The index files produced by the data folder can also gather 37489811 cells in total.
Hi, @subercui , will this difference caused by the version of CxG between 15 May 2023 and 8 May 2023? So the real version is not 15 May 2023 but 8 May 2023 stated in the paper?
There are 33 million cells all in CELLxGENE census version 15 May 2023, but only ~22M normal. So is it wrong described in the paper?