Closed katosh closed 2 weeks ago
Hi,
Thanks for reporting this. Would you mind running the beginning of the 01_SS2_processing
script to load all the libraries in the terminal (using Singularity shell) and post output of sessionInfo()
?
We can also try to share the integrated AnnData object, if this is easier?
Sure:
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scran_1.20.1 scuttle_1.2.0
## [3] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
## [5] Biobase_2.52.0 GenomicRanges_1.44.0
## [7] GenomeInfoDb_1.28.1 IRanges_2.26.0
## [9] S4Vectors_0.30.0 BiocGenerics_0.38.0
## [11] MatrixGenerics_1.4.1 matrixStats_0.60.0
## [13] anndata_0.7.5.2 viridis_0.6.1
## [15] viridisLite_0.4.0 ggrepel_0.9.1
## [17] ggplot2_3.3.5 reshape2_1.4.4
## [19] biomaRt_2.48.2 devtools_2.4.2
## [21] usethis_2.0.1 BiocManager_1.30.16
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 ellipsis_0.3.2
## [3] rprojroot_2.0.2 bluster_1.2.1
## [5] XVector_0.32.0 BiocNeighbors_1.10.0
## [7] fs_1.5.0 remotes_2.4.0
## [9] bit64_4.0.5 AnnotationDbi_1.54.1
## [11] fansi_0.5.0 xml2_1.3.2
## [13] sparseMatrixStats_1.4.0 cachem_1.0.5
## [15] knitr_1.33 pkgload_1.2.1
## [17] jsonlite_1.7.2 cluster_2.1.2
## [19] dbplyr_2.1.1 png_0.1-7
## [21] compiler_4.1.0 httr_1.4.2
## [23] dqrng_0.3.0 assertthat_0.2.1
## [25] Matrix_1.3-4 fastmap_1.1.0
## [27] limma_3.48.1 cli_3.0.1
## [29] BiocSingular_1.8.1 prettyunits_1.1.1
## [31] tools_4.1.0 rsvd_1.0.5
## [33] igraph_1.2.6 gtable_0.3.0
## [35] glue_1.4.2 GenomeInfoDbData_1.2.6
## [37] dplyr_1.0.7 rappdirs_0.3.3
## [39] Rcpp_1.0.7 vctrs_0.3.8
## [41] Biostrings_2.60.1 DelayedMatrixStats_1.14.2
## [43] xfun_0.24 stringr_1.4.0
## [45] ps_1.6.0 testthat_3.0.4
## [47] beachmat_2.8.0 lifecycle_1.0.0
## [49] irlba_2.3.3 statmod_1.4.36
## [51] XML_3.99-0.6 edgeR_3.34.0
## [53] zlibbioc_1.38.0 scales_1.1.1
## [55] hms_1.1.0 curl_4.3.2
## [57] memoise_2.0.0 reticulate_1.20
## [59] gridExtra_2.3 stringi_1.7.3
## [61] RSQLite_2.2.7 highr_0.9
## [63] desc_1.3.0 ScaledMatrix_1.0.0
## [65] filelock_1.0.2 pkgbuild_1.2.0
## [67] BiocParallel_1.26.1 rlang_0.4.11
## [69] pkgconfig_2.0.3 bitops_1.0-7
## [71] evaluate_0.14 lattice_0.20-44
## [73] purrr_0.3.4 bit_4.0.4
## [75] processx_3.5.2 tidyselect_1.1.1
## [77] plyr_1.8.6 magrittr_2.0.1
## [79] R6_2.5.0 generics_0.1.0
## [81] metapod_1.0.0 DelayedArray_0.18.0
## [83] DBI_1.1.1 pillar_1.6.2
## [85] withr_2.4.2 KEGGREST_1.32.0
## [87] RCurl_1.98-1.3 tibble_3.1.3
## [89] crayon_1.4.1 utf8_1.2.2
## [91] BiocFileCache_2.0.0 progress_1.2.2
## [93] locfit_1.5-9.4 grid_4.1.0
## [95] blob_1.2.2 callr_3.7.0
## [97] digest_0.6.27 munsell_0.5.0
## [99] sessioninfo_1.1.1
The integrated anndata object would also work for me!
It looks like the environment and the libraries are correct. I've just run this code step by step again on my machine and had no problems.
This error message is a bit cryptic to me. Are you sure data has been loaded correctly before the QC step? It should give a data.frame of dimensions: 54329, 1533
and then after QC 54329, 1288
.
I will get in touch soon regarding the processed data.
I get
> dim(data)
[1] 54329 1533
However since
> as.character(meta$cellid)
character(0)
the following https://github.com/Iwo-K/HSPCdynamics/blob/57bb4b9f822a0157429a456d5b1100d8d1a833a9/01_SS2_processing.R#L51 removes all columns:
> data = data[, as.character(meta$cellid)]
> dim(data)
[1] 54329 0
So, I looked at
> colnames(meta)
[1] "X...cellid" "RBG" "SLX"
[4] "plate_sorted" "plate_rearranged" "well_sorted"
[7] "well_rearranged" "set_index" "CI_index"
[10] "index" "mouse_platelabel" "sort_method"
[13] "sample.name" "population" "expdate"
[16] "timepoint_tx_days" "biosample_id" "mouse_id"
[19] "sex" "start_age" "tom"
[22] "batch" "countfolder" "sample_id"
and noticed that the there is no column named cellid
but instead a broken looking X...cellid
. So, I inspected ./data/SS2/scB5Tom_SS2_cellmeta.csv
and found that it starts with <U+FEFF>
, a Byte Order Mark (BOM):
<U+FEFF>cellid,RBG,SLX,plate_sorted,plate_rearranged,well_sorted,well_rearranged,set_index,CI_index,index,mouse_platelabel,sort_method,sample.name,population,expdate,timepoint_tx_days,biosample_id,mouse_id,sex,start_age,tom,batch,countfolder,sample_id
SLX19435.i701_i502,RBG35462,SLX19435,plate1,plate1,A1,A1,setA,i701_i502,N701-S502,mouse1_32c,singlecell,mouse1_32c_7d_Linneg_Kit+_OR_Sca1+_Tom+_singlecellmode,Linneg_Kit+_OR_Sca1+_Tom+,20201020,7,7d_SS2_14681.32c,14681.32c,female,young,pos,7d,./data/SS2/7d,7d_SS2
...
I obtained this data from Mendely data as described here and unpacked it with UnZip 6.00 of 20 April 2009, by Debian.
that ships with Ubuntu 18.04.6 LTS.
To fix this I replaced https://github.com/Iwo-K/HSPCdynamics/blob/57bb4b9f822a0157429a456d5b1100d8d1a833a9/01_SS2_processing.R#L23 with
meta = read.csv("./data/SS2/scB5Tom_SS2_cellmeta.csv", as.is = TRUE, fileEncoding = "UTF-8-BOM")
and now the preprocessing script seems to works for me.
I did not expect this to be honest, thank you very much for investigating. The original file also has the <U+FEFF> at the beginning, but calling regular meta = read.csv("./data/SS2/scB5Tom_SS2_cellmeta.csv", as.is = TRUE)
returns everything correctly. I will update the README to point to this issue as a fix.
I've uploaded the combined_filtered_landscape.h5ad to Mendeley Data, this should go through moderation in the next couple of days. I will post here (and the README) the link when this is ready. This is an AnnData object with all data integrated together (after removal of undesired populations) . This should more convenient and make it easier to follow the analysis from script 05 onwards.
Processed data (combined_filtered_landscape.h5ad) is now available here: https://doi.org/10.17632/vwg6xzmrf9.2
Hi @Iwo-K, thank you so much for the processed data, it looks great and is going to be very helpful!!
I also tried completing the pipeline but I noticed that the last step of the preprocessing failed after all:
> adata = AnnData(X = t(dataQC), obs = metaQC, var = genedata)
## Error in py_module_import(module, convert = convert): ModuleNotFoundError: No module named 'anndata'
##
## Detailed traceback:
## File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 39, in _import_hook
## module = _import(
Hence the Python that is being used through reticulate by the anndata R package, does not seem to have the anndata Python package available. So, I checked which Python the containered R is using by running
$ singularity exec rpy_v4_p3_fix2.sif R -e 'library(reticulate); py_config()'
R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
During startup - Warning messages:
1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called 'colorout'
2: package 'colorout' in options("defaultPackages") was not found
3: Setting LC_CTYPE failed, using "C"
4: Setting LC_COLLATE failed, using "C"
5: Setting LC_TIME failed, using "C"
6: Setting LC_MESSAGES failed, using "C"
7: Setting LC_MONETARY failed, using "C"
8: Setting LC_PAPER failed, using "C"
9: Setting LC_MEASUREMENT failed, using "C"
> library(reticulate); py_config()
python: /home/dotto/.local/share/r-miniconda/envs/r-reticulate/bin/python
libpython: /home/dotto/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.9.so
pythonhome: /home/dotto/.local/share/r-miniconda/envs/r-reticulate:/home/dotto/.local/share/r-miniconda/envs/r-reticulate
version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) [GCC 12.3.0]
numpy: /home/dotto/.local/lib/python3.9/site-packages/numpy
numpy_version: 1.24.4
>
and ideed, even though it is running inside my container, it is using the Python interpreter I have installed outside of the container. So, I tried fixing this by placing these lines on top of 01_SS2_processing.R
:
reticulate::use_python("/bin/python3")
But I only got this message when saving the anndata:
## Warning: Python '/bin/python3' was requested but '/home/dotto/.local/
## share/r-miniconda/envs/r-reticulate/bin/python' was loaded instead (see
## reticulate::py_config() for more information)
So, I am not sure how to fix this. But this is just FYI, and since I was only trying to run this to get the processed data.
Thank you!
Following the How-to in the Readme and using Singularity v3.5.3 I run into the following error during the 04_integration step:
and indeed:
Even though 01_SS2_processing runs successfully, but inspecting 01_SS2_processing.md reveals:
This then leads to an error when saving the anndata object later in the script. How can I fix this?