GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Mitochondrial genes removed at import10xFeatureMatrix #564

Closed kkobayashikirschvink closed 1 year ago

kkobayashikirschvink commented 3 years ago

Hi Jeffrey and Ryan,

I was testing the multiome pipeline following the tutorial below. https://greenleaflab.github.io/ArchR_2020/Ex-Analyze-Multiome.html

However, I found out that at the import10xFeatureMatrix function, all the mitochondrial genes were removed.

seRNA <- import10xFeatureMatrix(input = input_files, names = input_names)
Importing Feature Matrix 1 of 1
sum(str_detect(rownames(seRNA),'MT-'))
[1] 0

Is this the default behavior? I have confirmed that they exist in the cellranger output like below.

Screenshot 2021-03-02 210844

Thanks!

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] openxlsx_4.2.3                     ggrepel_0.9.1                     
 [3] circlize_0.4.12                    ComplexHeatmap_2.6.2              
 [5] gtable_0.3.0                       harmony_1.0                       
 [7] Rcpp_1.0.6                         uwot_0.1.10                       
 [9] gridExtra_2.3                      nabor_0.5.0                       
[11] BSgenome.Mmusculus.UCSC.mm10_1.4.0 SeuratDisk_0.0.0.9013             
[13] qusage_2.24.0                      limma_3.46.0                      
[15] stringr_1.4.0                      BSgenome.Hsapiens.UCSC.hg38_1.4.3 
[17] BSgenome_1.58.0                    rtracklayer_1.50.0                
[19] Biostrings_2.58.0                  XVector_0.30.0                    
[21] Seurat_3.2.3                       ArchR_1.0.1                       
[23] magrittr_2.0.1                     rhdf5_2.34.0                      
[25] Matrix_1.2-18                      data.table_1.13.6                 
[27] SummarizedExperiment_1.20.0        Biobase_2.50.0                    
[29] GenomicRanges_1.42.0               GenomeInfoDb_1.26.2               
[31] IRanges_2.24.1                     S4Vectors_0.28.1                  
[33] BiocGenerics_0.36.0                MatrixGenerics_1.2.0              
[35] matrixStats_0.57.0                 ggplot2_3.3.3                     

loaded via a namespace (and not attached):
  [1] plyr_1.8.6               igraph_1.2.6             lazyeval_0.2.2          
  [4] splines_4.0.3            BiocParallel_1.24.1      listenv_0.8.0           
  [7] scattermore_0.7          digest_0.6.27            htmltools_0.5.1.1       
 [10] fansi_0.4.2              tensor_1.5               cluster_2.1.0           
 [13] ROCR_1.0-11              globals_0.14.0           colorspace_2.0-0        
 [16] xfun_0.20                dplyr_1.0.3              crayon_1.3.4            
 [19] RCurl_1.98-1.2           jsonlite_1.7.2           spatstat_1.64-1         
 [22] spatstat.data_1.7-0      survival_3.2-7           zoo_1.8-8               
 [25] glue_1.4.2               polyclip_1.10-0          emmeans_1.5.3           
 [28] zlibbioc_1.36.0          leiden_0.3.6             GetoptLong_1.0.5        
 [31] DelayedArray_0.16.1      Rhdf5lib_1.12.0          future.apply_1.7.0      
 [34] shape_1.4.5              abind_1.4-5              scales_1.1.1            
 [37] mvtnorm_1.1-1            DBI_1.1.1                miniUI_0.1.1.1          
 [40] viridisLite_0.3.0        xtable_1.8-4             clue_0.3-58             
 [43] reticulate_1.18          bit_4.0.4                rsvd_1.0.3              
 [46] htmlwidgets_1.5.3        httr_1.4.2               RColorBrewer_1.1-2      
 [49] ellipsis_0.3.1           ica_1.0-2                pkgconfig_2.0.3         
 [52] XML_3.99-0.5             deldir_0.2-9             tidyselect_1.1.0        
 [55] rlang_0.4.10             reshape2_1.4.4           later_1.1.0.1           
 [58] munsell_0.5.0            tools_4.0.3              cli_2.2.0               
 [61] generics_0.1.0           ggridges_0.5.3           evaluate_0.14           
 [64] fastmap_1.0.1            yaml_2.2.1               goftest_1.2-2           
 [67] bit64_4.0.5              knitr_1.30               fitdistrplus_1.1-3      
 [70] zip_2.1.1                purrr_0.3.4              RANN_2.6.1              
 [73] pbapply_1.4-3            future_1.21.0            nlme_3.1-150            
 [76] mime_0.9                 hdf5r_1.3.3              compiler_4.0.3          
 [79] rstudioapi_0.13          plotly_4.9.3             png_0.1-7               
 [82] spatstat.utils_2.0-0     tibble_3.0.5             stringi_1.5.3           
 [85] lattice_0.20-41          fftw_1.0-6               vctrs_0.3.6             
 [88] pillar_1.4.7             lifecycle_0.2.0          rhdf5filters_1.2.0      
 [91] lmtest_0.9-38            GlobalOptions_0.1.2      estimability_1.3        
 [94] RcppAnnoy_0.0.18         cowplot_1.1.1            bitops_1.0-6            
 [97] irlba_2.3.3              httpuv_1.5.5             patchwork_1.1.1         
[100] R6_2.5.0                 promises_1.1.1           KernSmooth_2.23-18      
[103] parallelly_1.23.0        codetools_0.2-18         MASS_7.3-53             
[106] assertthat_0.2.1         rjson_0.2.20             withr_2.4.0             
[109] GenomicAlignments_1.26.0 sctransform_0.3.2        Rsamtools_2.6.0         
[112] GenomeInfoDbData_1.2.4   mgcv_1.8-33              rpart_4.1-15            
[115] tidyr_1.1.2              rmarkdown_2.6            Cairo_1.5-12.2          
[118] Rtsne_0.15               shiny_1.5.0              tinytex_0.29
rcorces commented 3 years ago

This is the expected behavior and I'm not sure how to fix this. This happens at this line of code: https://github.com/GreenleafLab/ArchR/blob/2f022a448d8248a0f9afb33419bcbaeafe7731c0/R/MultiModal.R#L89

The feature matrix in the hdf5 file (accessed via features <- h5read(featureMatrix, "/matrix/features")) has a column called "interval" which gives chromosome coordinates. In the 10x multiome example datasets, the interval column is all "NA" for the mitochondrial genes and thus these genes are excluded from the SummarizedExperiment.

> rowData(se)[grep(pattern = "MT-", x = rownames(rowData(se))),]
DataFrame with 13 rows and 5 columns
           feature_type genome              id interval    name
                  <Rle>  <Rle>         <array>  <array> <array>
MT-ND1  Gene Expression GRCh38 ENSG00000198888       NA  MT-ND1
MT-ND2  Gene Expression GRCh38 ENSG00000198763       NA  MT-ND2
MT-CO1  Gene Expression GRCh38 ENSG00000198804       NA  MT-CO1
MT-CO2  Gene Expression GRCh38 ENSG00000198712       NA  MT-CO2
MT-ATP8 Gene Expression GRCh38 ENSG00000228253       NA MT-ATP8
...                 ...    ...             ...      ...     ...
MT-ND4L Gene Expression GRCh38 ENSG00000212907       NA MT-ND4L
MT-ND4  Gene Expression GRCh38 ENSG00000198886       NA  MT-ND4
MT-ND5  Gene Expression GRCh38 ENSG00000198786       NA  MT-ND5
MT-ND6  Gene Expression GRCh38 ENSG00000198695       NA  MT-ND6
MT-CYB  Gene Expression GRCh38 ENSG00000198727       NA  MT-CYB

I'm not sure why these genes dont have annotated intervals (that seems like a 10x problem?). Presumably you could try to manually edit this yourself.

Anything I'm missing @jgranja24 ?

kkobayashikirschvink commented 3 years ago

Hi Ryan,

Thank you for the reply!

I wanted to filter some cells based on their mitochondrial gene counts. What do you think is the easiest way to do this? For example, can I import other data formats to ArchR such as h5ad or Seurat files?

kkobayashikirschvink commented 3 years ago

For your information, I contacted 10x about this, and it looks like they don't include chrM gene coordinates as they exclude them for peak calling in cellranger.

In the ARC pipeline, we consider chrM as the non_nuclear_contigs (https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/release-notes/references#GRCh38-2020-A):
non_nuclear_contigs: [\"chrM\"]

Here are some more detail about the 'non_nuclear_contigs' (https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/advanced/references): 

non_nuclear_contigs (Optional; list of strings) name(s) of contig(s) that do not have any chromatin structure, for example, mitochondria or plastids. For the GRCh38 assembly this would be ["chrM"]. These contigs are excluded from peak calling since the entire contig will be "open" due to a lack of chromatin structure.

As you can see, chrM is excluded from peak calling. Given that you are working with GEX+ATAC joint analysis, we did not include the full information of genes in chrM in GEX results as well.
rcorces commented 3 years ago

Thanks for posting the information from 10x. I think the most straightforward way to do this would be to use the RNA data alone to identify barcodes that you want to remove based on MT- genes and then remove these cells from the ArchR project using subsetting. This will require some manual work on your end and is not part of a standard ArchR workflow at the moment but we will take this into account as this is a common filtering step in scRNA-seq

kkobayashikirschvink commented 3 years ago

Hi Ryan,

Thanks, that would be fantastic. For now I’ll subset cells based on barcodes as you suggested.

rcorces commented 1 year ago

Better late than never. We've done a big overhaul to the multiomic import functions and now make it possible to retain chrM genes. This is currently available on dev and will be incorporated into the next stable release. To do this, you need to tell ArchR what the gene intervals are for these genes via the features argument. This is because CellRanger doesnt provide those gene positions.

https://github.com/GreenleafLab/ArchR/blob/bb0d391cec77ba751410518beac824470629dd74/R/MultiModal.R#L19-L22

kkobayashikirschvink commented 1 year ago

Thank you so much! This is fantastic

gpersico91 commented 1 year ago

Dears, I'm working on multiome data (RNA+ATAC). I want to remove MT genes from RNA assay, so

genes.use <- grep(pattern = "^MT-", rownames(seurat),value=TRUE, invert=T) #get list of non-ribosomal genes 
seurat<- subset(seurat, features = C024genes.use)

However, after subset function, the new seurat object contain only RNA assay, the ATAC assay is not present How i can solve this?

thanks in advance

Giuseppe