chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

is_primary_data==False for cells not duplicated #798

Closed emdann closed 11 months ago

emdann commented 1 year ago

Related to https://github.com/chanzuckerberg/cellxgene-census/issues/468

I'm using cxg census for meta analysis across all available cells for the same tissue. I found that by filtering for is_primary_data==True to remove duplicate cells I also end up excluding some cells that are part of meta-analysis studies (e.g. re-analysed in a data integration effort), but for which the primary data is not deposited in CxG collections.

In particular I've noticed this is the case for the datasets included in the Human Lung Cell Atlas dataset: e.g. of the 268932 cells with disease=='pulmonary fibrosis', only 51343 are recorded as primary data, even though the rest are not found in any other CxG collection (the original study is Adams et al. 2020). According to the docs the is_primary_data column should label duplicated cells, but in this case it's labelling cells that have been reprocessed but not duplicated.

In the short term it would be useful to amend the docs to explain this. Ideally, there should be a distinction between duplicated and re-analysed cells. In my case I have no choice but to exclude these datasets to avoid including duplicated cells.

jahilton commented 11 months ago

Hi @emdann , Thank you for bringing this to our attention. We will do a thorough review and it does appear that is_primary_data should be revised in HLCA.

The observations you're referring in HLCA from Adams et al. (labeled as study:Kaminski_2020) were marked is_primary_data:False in HLCA because observations from that study were already marked is_primary_data:True in the 'extended atlas' Data of the LuCA Collection labeled as dataset:Adams_Kaminski_2020. It is now clear that the pulmonary fibrosis samples, and potentially others, were not included in the LuCA Collection

jahilton commented 11 months ago

@emdann , the revisions to HLCA is_primary_data are now Published in CELLxGENE Discover, and will be reflected in future Census releases. Details are in review is_primary_data in HLCA