Closed samuelmontgomery closed 1 year ago
Hi @samuelmontgomery ,
If I'm correct you can find these under adata.raw (according to cellxgene standards), could you check? I should put that in the explanatory text on the cellxgene page, will make sure to do that
Thanks @LisaSikkema - is that a separate download or when I import the file with sc.read? I am following the steps in the jupyter notebook for creating a deconvolution matrix I have downloaded and imported the core dataset, and this one has no layers when imported
adata = sc.read("/home/ubuntu/scratch/references/hlca/core.h5ad")
AnnData object with n_obs × n_vars = 584944 × 28024 obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'dataset', 'entropy_dataset_leiden_3', 'entropy_original_ann_level_1_leiden_3', 'entropy_original_ann_level_2_clean_leiden_3', 'entropy_original_ann_level_3_clean_leiden_3', 'entropy_subject_ID_leiden_3', 'fresh_or_frozen', 'leiden_1', 'leiden_2', 'leiden_3', 'leiden_4', 'leiden_5', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'n_genes_detected', 'original_ann_highest_res', 'original_ann_level_1', 'original_ann_level_2', 'original_ann_level_3', 'original_ann_level_4', 'original_ann_level_5', 'original_ann_nonharmonized', 'reannotation_type', 'reference_genome', 'sample', 'scanvi_label', 'sequencing_platform', 'size_factors', 'smoking_status', 'study', 'subject_type', 'tissue_dissociation_protocol', 'tissue_level_2', 'tissue_level_3', 'tissue_sampling_method', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage' var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype' uns: 'batch_condition', 'default_embedding', 'schema_version', 'title' obsm: 'X_scanvi_emb', 'X_umap' obsp: 'connectivities', 'distances'
yeah the cellxgene object is slightly different from the file I worked with, as they have particular formatting requirements.
You won't find the layer by running adata
and seeing what it prints. Could you see what it prints when you run adata.raw
?
You would then have to move the raw counts to adata.layers['counts']
manually to make it work with the deconvolution matrix script
adata.raw prints <anndata._core.raw.Raw at 0x7f83ebad8b90> I have read through the cellxgene documentation and it seems that it is just the raw data is stored in adata.X and the normalised data is in the adata.layers("soupX") which was not clear to me (but I am v. novice) Thanks for your help!
okay, try adata.raw.X
It is also all explained in this table: https://github.com/LungCellAtlas/HLCA/blob/main/docs/HLCA_metadata_explanation.csv
Thanks Lisa, can confirm that worked
Hi,
I downloaded the full HLCA dataset (in .h5ad format) from cellxgene, but when importing the data it doesn't contain the "counts" layer as the HLCA Reproducibility github suggests it should
AnnData object with n_obs × n_vars = 2282447 × 56295 backed at '/home/ubuntu/scratch/references/hlca/local.h5ad' obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', "3'_or_5'", 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'core_or_extension', 'dataset', 'fresh_or_frozen', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'original_ann_level_1', 'original_ann_level_2', 'original_ann_level_3', 'original_ann_level_4', 'original_ann_level_5', 'original_ann_nonharmonized', 'reannotation_type', 'sample', 'scanvi_label', 'sequencing_platform', 'smoking_status', 'study', 'subject_type', 'tissue_coarse_unharmonized', 'tissue_detailed_unharmonized', 'tissue_dissociation_protocol', 'tissue_level_2', 'tissue_level_3', 'tissue_sampling_method', 'total_counts', 'transf_ann_level_1_label', 'transf_ann_level_1_uncert', 'transf_ann_level_2_label', 'transf_ann_level_2_uncert', 'transf_ann_level_3_label', 'transf_ann_level_3_uncert', 'transf_ann_level_4_label', 'transf_ann_level_4_uncert', 'transf_ann_level_5_label', 'transf_ann_level_5_uncert', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage' var: 'feature_is_filtered', 'original_gene_symbols', 'feature_name', 'feature_reference', 'feature_biotype' uns: 'batch_condition', 'default_embedding', 'schema_version', 'title' obsm: 'X_scanvi_emb', 'X_umap' layers: 'soupX' obsp: 'connectivities', 'distances'
Is there something I am missing? Or does the full dataset only contain soupX counts?