LungCellAtlas / HLCA

MIT License
45 stars 5 forks source link

full HLCA dataset only contains soupX layer, not counts #6

Closed samuelmontgomery closed 1 year ago

samuelmontgomery commented 1 year ago

Hi,

I downloaded the full HLCA dataset (in .h5ad format) from cellxgene, but when importing the data it doesn't contain the "counts" layer as the HLCA Reproducibility github suggests it should

AnnData object with n_obs × n_vars = 2282447 × 56295 backed at '/home/ubuntu/scratch/references/hlca/local.h5ad' obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', "3'_or_5'", 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'core_or_extension', 'dataset', 'fresh_or_frozen', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'original_ann_level_1', 'original_ann_level_2', 'original_ann_level_3', 'original_ann_level_4', 'original_ann_level_5', 'original_ann_nonharmonized', 'reannotation_type', 'sample', 'scanvi_label', 'sequencing_platform', 'smoking_status', 'study', 'subject_type', 'tissue_coarse_unharmonized', 'tissue_detailed_unharmonized', 'tissue_dissociation_protocol', 'tissue_level_2', 'tissue_level_3', 'tissue_sampling_method', 'total_counts', 'transf_ann_level_1_label', 'transf_ann_level_1_uncert', 'transf_ann_level_2_label', 'transf_ann_level_2_uncert', 'transf_ann_level_3_label', 'transf_ann_level_3_uncert', 'transf_ann_level_4_label', 'transf_ann_level_4_uncert', 'transf_ann_level_5_label', 'transf_ann_level_5_uncert', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage' var: 'feature_is_filtered', 'original_gene_symbols', 'feature_name', 'feature_reference', 'feature_biotype' uns: 'batch_condition', 'default_embedding', 'schema_version', 'title' obsm: 'X_scanvi_emb', 'X_umap' layers: 'soupX' obsp: 'connectivities', 'distances'

Is there something I am missing? Or does the full dataset only contain soupX counts?

LisaSikkema commented 1 year ago

Hi @samuelmontgomery ,

If I'm correct you can find these under adata.raw (according to cellxgene standards), could you check? I should put that in the explanatory text on the cellxgene page, will make sure to do that

samuelmontgomery commented 1 year ago

Thanks @LisaSikkema - is that a separate download or when I import the file with sc.read? I am following the steps in the jupyter notebook for creating a deconvolution matrix I have downloaded and imported the core dataset, and this one has no layers when imported

adata = sc.read("/home/ubuntu/scratch/references/hlca/core.h5ad")

AnnData object with n_obs × n_vars = 584944 × 28024 obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'dataset', 'entropy_dataset_leiden_3', 'entropy_original_ann_level_1_leiden_3', 'entropy_original_ann_level_2_clean_leiden_3', 'entropy_original_ann_level_3_clean_leiden_3', 'entropy_subject_ID_leiden_3', 'fresh_or_frozen', 'leiden_1', 'leiden_2', 'leiden_3', 'leiden_4', 'leiden_5', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'n_genes_detected', 'original_ann_highest_res', 'original_ann_level_1', 'original_ann_level_2', 'original_ann_level_3', 'original_ann_level_4', 'original_ann_level_5', 'original_ann_nonharmonized', 'reannotation_type', 'reference_genome', 'sample', 'scanvi_label', 'sequencing_platform', 'size_factors', 'smoking_status', 'study', 'subject_type', 'tissue_dissociation_protocol', 'tissue_level_2', 'tissue_level_3', 'tissue_sampling_method', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage' var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype' uns: 'batch_condition', 'default_embedding', 'schema_version', 'title' obsm: 'X_scanvi_emb', 'X_umap' obsp: 'connectivities', 'distances'

LisaSikkema commented 1 year ago

yeah the cellxgene object is slightly different from the file I worked with, as they have particular formatting requirements.

You won't find the layer by running adata and seeing what it prints. Could you see what it prints when you run adata.raw? You would then have to move the raw counts to adata.layers['counts'] manually to make it work with the deconvolution matrix script

samuelmontgomery commented 1 year ago

adata.raw prints <anndata._core.raw.Raw at 0x7f83ebad8b90> I have read through the cellxgene documentation and it seems that it is just the raw data is stored in adata.X and the normalised data is in the adata.layers("soupX") which was not clear to me (but I am v. novice) Thanks for your help!

LisaSikkema commented 1 year ago

okay, try adata.raw.X

LisaSikkema commented 1 year ago

It is also all explained in this table: https://github.com/LungCellAtlas/HLCA/blob/main/docs/HLCA_metadata_explanation.csv

Screenshot 2023-06-22 at 11 09 38
samuelmontgomery commented 1 year ago

Thanks Lisa, can confirm that worked