taxonomy_mapping() does not find taxonomy column names when using latest docker image

meghanaturner commented 1 year ago

Using the latest scrattch-mapping docker image release leads to an error on line 22 of R/taxonomy_mapping() because colnames(AIT.anndata$uns$clusterInfo) returns NULL.

This issue can be fixed by switching back to the 0.16 version of the docker image. Using 0.16 and the exact same taxonomy h5ad file (//allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/meghanturner/brain3_mapping/taxonomies/AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_dense_for_mapping.h5ad), colnames(AIT.anndata$uns$clusterInfo) returns the expected list of column names and the code runs as it should.

@berl @egelfan2

UCDNJJ commented 1 year ago

Thanks for reporting this error!

Starting with the latest scrattch-mapping docker (bicore/scrattch_mapping:latest). I wasn't able to recreate this issue from our test cases, so you've found some fun edge case. To be complete, I also tried loading AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_dense_for_mapping.h5ad directly with anndata::read_h5ad() and found the column names available for AIT.anndata$uns$clusterInfo.

I'll need some additional info to figure out what's going on. A few questions:

Are you using anndata::read_h5ad() or scrattch_mapping::loadTaxonomy() to load the taxonomy into R?
Can you share the directory containing all the taxonomy files that should have been created with scrattch_mapping::buildTaxonomy()?
Do you mind sharing your script that works only under scrattch_mapping version 0.16?
Can you share the error report?

berl commented 1 year ago

FYI @scseeman

meghanaturner commented 1 year ago

I'm using anndata::read_h5ad() to load the taxonomy
The taxonomy file wasn't built directly with scrattch_mapping::buildTaxonomy()
/allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/meghanturner/brain3_mapping/scrattch-mapping_batch.R
Error: Error in taxonomy_mapping(AIT.anndata = taxonomy_anndata, query.data = query_data, : Not all label.cols exists in AIT.anndata$uns$clusterInfo

That's interesting that you can find the column names when you load with anndata::read_h5ad() in the latest. For me, there's different behavior in how the column names are accessible between the two versions.

In the latest version:

colnames(anndata$uns$clusterInfo), which is called in line 22 of taxonomy_mapping, returns NULL
whereas, anndata$uns$clusterInfo$columns returns the expected column names:

[1] "sample_id" "cl" "cluster_label"
[4] "Level2_id_label" "Level1_id_label" "supertype_id_label" [7] "class_id_label" "nt_type_label" "cluster_id.AIT16"
[10] "library_prep" "gene.counts.0" "doublet_score"
[13] "roi" "umi.counts" "qc.score"
[16] "method" "region_label" "region_id"
[19] "sex" "external_donor_name" "age"
[22] "platform" "knn.dist" "knn.dist.z"
[25] "medical_conditions" "broad_region" "cluster_id"
[28] "neighborhood" "batch"
class(taxonomy_anndata$uns$clusterInfo) returns:

[1] "pandas.core.frame.DataFrame" "pandas.core.generic.NDFrame"
[3] "pandas.core.base.PandasObject" "pandas.core.accessor.DirNamesMixin" [5] "pandas.core.indexing.IndexingMixin" "pandas.core.arraylike.OpsMixin"
[7] "python.builtin.object"

In 0.16 the opposite is true:

anndata$uns$clusterInfo$columns returns NULL
colnames(anndata$uns$clusterInfo), which is called in line 22 of taxonomy_mapping, returns the expected column names:

[1] "sample_id" "cl" "cluster_label"
[4] "Level2_id_label" "Level1_id_label" "supertype_id_label" [7] "class_id_label" "nt_type_label" "cluster_id.AIT16"
[10] "library_prep" "gene.counts.0" "doublet_score"
[13] "roi" "umi.counts" "qc.score"
[16] "method" "region_label" "region_id"
[19] "sex" "external_donor_name" "age"
[22] "platform" "knn.dist" "knn.dist.z"
[25] "medical_conditions" "broad_region" "cluster_id"
[28] "neighborhood" "batch"
class(taxonomy_anndata$uns$clusterInfo) returns:

[1] "data.frame"

UCDNJJ commented 1 year ago

I was able to consistently retrieve anndata$uns$clusterInfo as a data.frame under both scrattch_mapping versions. This has to be an environment issue thats leading to you seeing pandas.core.frame.DataFrame under the latest docker. One last ask: Can you return the sessionInfo() for each scrattch_mapping docker when you are running it.

Somehow the version of anndata (R library) was downgraded in the latest scrattch mapping docker. I suspect this is the culprit:

bicore/scrattch_mapping:latest -- anndata_0.7.5.3 bicore/scrattch_mapping:0.16 -- anndata_0.7.5.6

scseeman commented 1 year ago

@meghanaturner @UCDNJJ a couple of weeks ago I was having issues with :latest docker loading at all. I talked with Anish about it and realize that I never actually heard if it got fixed. I've been using the singularity file /allen/programs/celltypes/workgroups/rnaseqanalysis/bicore/singularity/scrattch_mapping_0.2.sif directly instead of the docker and that has worked fine

meghanaturner commented 1 year ago

Indeed, 0.16 has anndata="0.7.5.6" and latest has anndata="0.7.5.3"

It looks like R was also downgraded:

0.16 sessionInfo():

R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.2.2

latest sessionInfo():

R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.2.0

UCDNJJ commented 1 year ago

Hi @meghanaturner, when you have time can you check that this issues is resolved when using this docker image: docker://njjai/scrattch_mapping:0.4. This docker image has the most up to date versions of the anndata and R packages that the previous latest image was supposed to contain.

singularity shell --cleanenv docker://njjai/scrattch_mapping:0.4

Forewarning, quite a few changes exist in this new update. So if you hit an error let us know.

meghanaturner commented 1 year ago

Hi @UCDNJJ, this docker image seems to have fixed the original issue I reported where the column names weren't found

meghanaturner commented 1 year ago

@UCDNJJ However, read_h5ad() no longer reads in sparse matrices as the data type that scrattch-mapping is expecting to find. This problem spontaneously showed up in docker://bicore/scrattch_mapping:0.16 and docker://bicore/scrattch_mapping:latest a couple weeks ago, and does not appear to be fixed by docker://njjai/scrattch_mapping:0.4.

"Error caught for Correlation mapping." <simpleError in validObject(.Object): invalid class “dgCMatrix” object: 'Dim' slot does not have length 2> Error in rownames<-(*tmp*, value = colnames(query.data)) : attempt to set 'rownames' on an object with no dimensions Calls: taxonomy_mapping -> rownames<-

The same error is thrown for a dgCMatrix. The workaround is to only use taxonomy and spatial anndata objects where X is a dense matrix.

As an alternative to read_h5ad, I tried using

loadTaxonomy(taxonomyDir = "//allen/programs/celltypes/workgroups/rnaseqanalysis/mFISH/meghanturner/brain3_mapping/AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_cscSparseX.h5ad", anndata_file = "AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_cscSparseX.h5ad")

but despite the documentation for the taxonomyDir argument suggesting that it supports direct h5ad files that aren't part of a shiny taxonomy folder, it errors out with: Required files to load Allen Institute taxonomy are missing.

I saw that you split off scrattch-taxonomy, including loadTaxonomy(), from scrattch-mapping into it's own repo. Should I raise this issue over there?

UCDNJJ commented 1 year ago

Interesting, we definitely don't want to be using dense matrices all the time! Let's leave this issue here for now.

We need to do a better job with documentation but you should always use loadTaxonomy() since we do some work in that function to make sure the anndata object is initialized for mapping. The anndata_file argument assumes an .h5ad file that was generated with buildTaxonomy() which is why you are seeing that error about missing required files.

I took a quick look and AIT17.0.logCPM.sampled100_MERSCOPE_BRAIN3_GENES_cscSparseX.h5ad doesn't appear to have been setup with buildTaxonomy() so this .h5ad will not work with scrattch.mapping. I would suggest running buildTaxonomy() using the count matrix and metadata from that object, you can follow the steps in this tutorial: build_taxonomy

Also, can see if you can run the tutorial without error: mapping

meghanaturner commented 1 year ago

In attempting to follow the build_taxonomy tutorial, I am unable to load the counts matrix from the taxonomy I'm using into R.

I am not familiar with R, so I'm not sure what R's anndata package is expecting to find in an ad.X stored as a CSR sparse matrix. And the tutorial does not provide any suggestions of how to read in counts matrices from other h5ad files (it just does library(tasic2016data); taxonomy.counts = tasic_2016_counts)

# import libraries
library(scrattch.mapping)
library(umap)

# taxonomy I want to use for mapping my spatial data
taxonomy_h5ad_path = "//allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/Taxonomies/AIT17.0_mouse/Prepare/AIT17.0.logCPM.sampled100.h5ad"

# Load taxonomy anndata file
taxonomy_anndata = read_h5ad(taxonomy_h5ad_path )

# Load the count data
taxonomy.counts = taxonomy_anndata$X   # ***this line fails***

Error:

Error in py_ref_to_r(x) : negative length vectors are not allowed
Calls: <Anonymous> ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r

UCDNJJ commented 1 year ago

So I also tried running through your code both in a separate R environment and within the scrattch.mapping docker. Both produced the same error.

Could this error be arising due to the dataset size or some change in the .h5ad file that happened a few weeks ago.

I can use the same approach you shared with a dataset or ~340k cells and ~22k genes: /allen/programs/celltypes/workgroups/rnaseqanalysis/shiny/10x_seq/NHP_BG_AIT_115/NHP_BG_AIT115_complete.h5ad. R successfully reads in the anndata$X as a dgR sparse matrix.

AllenInstitute / scrattch.mapping

taxonomy_mapping() does not find taxonomy column names when using latest docker image #30