Closed emdann closed 11 months ago
Hi @emdann , Thank you for bringing this to our attention. We will do a thorough review and it does appear that is_primary_data should be revised in HLCA.
The observations you're referring in HLCA from Adams et al. (labeled as study:Kaminski_2020) were marked is_primary_data:False in HLCA because observations from that study were already marked is_primary_data:True in the 'extended atlas' Data of the LuCA Collection labeled as dataset:Adams_Kaminski_2020. It is now clear that the pulmonary fibrosis samples, and potentially others, were not included in the LuCA Collection
@emdann , the revisions to HLCA is_primary_data are now Published in CELLxGENE Discover, and will be reflected in future Census releases. Details are in review is_primary_data in HLCA
Related to https://github.com/chanzuckerberg/cellxgene-census/issues/468
I'm using cxg census for meta analysis across all available cells for the same tissue. I found that by filtering for
is_primary_data==True
to remove duplicate cells I also end up excluding some cells that are part of meta-analysis studies (e.g. re-analysed in a data integration effort), but for which the primary data is not deposited in CxG collections.In particular I've noticed this is the case for the datasets included in the Human Lung Cell Atlas dataset: e.g. of the 268932 cells with
disease=='pulmonary fibrosis'
, only 51343 are recorded as primary data, even though the rest are not found in any other CxG collection (the original study is Adams et al. 2020). According to the docs theis_primary_data
column should label duplicated cells, but in this case it's labelling cells that have been reprocessed but not duplicated.In the short term it would be useful to amend the docs to explain this. Ideally, there should be a distinction between duplicated and re-analysed cells. In my case I have no choice but to exclude these datasets to avoid including duplicated cells.