Iwo-K / HSPCdynamics

2 stars 0 forks source link

Error making `procdata/01script/SS2data.h5ad` #2

Closed katosh closed 2 weeks ago

katosh commented 8 months ago

Following the How-to in the Readme and using Singularity v3.5.3 I run into the following error during the 04_integration step:

Running:  04_integration
[jupytext] Reading 04_integration.py in format py
[jupytext] Writing 04_integration.ipynb (destination file replaced [use --update to preserve cell outputs and ids])
[jupytext] Updating the timestamp of 04_integration.py
[NbConvertApp] Converting notebook 04_integration.ipynb to notebook
2024-02-26 13:12:17.006717: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set t$
e environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-26 13:12:17.008909: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-26 13:12:17.045070: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-26 13:12:17.045520: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-26 13:12:18.242159: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
    sys.exit(main())
  File "/home/dotto/.local/lib/python3.8/site-packages/jupyter_core/application.py", line 254, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/home/dotto/.local/lib/python3.8/site-packages/traitlets/config/application.py", line 845, in launch_instance
    app.start()
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 350, in start
    self.convert_notebooks()
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 524, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 489, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/nbconvertapp.py", line 418, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 181, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 199, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/exporters/notebook.py", line 32, in from_notebook_node
    nb_copy, resources = super().from_notebook_node(nb, resources, **kw)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 143, in from_notebook_node
    nb_copy, resources = self._preprocess(nb_copy, resources)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 318, in _preprocess
    nbc, resc = preprocessor(nbc, resc)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__
    return self.preprocess(nb, resources)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", line 79, in preprocess
    self.execute()
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/util.py", line 74, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/util.py", line 53, in just_run
    return loop.run_until_complete(coro)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/client.py", line 553, in async_execute
    await self.async_execute_cell(
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", line 123, in async_execute_cell
    cell, resources = self.preprocess_cell(cell, self.resources, cell_index)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", line 146, in preprocess_cell
    cell = run_sync(NotebookClient.async_execute_cell)(self, cell, index, store_history=self.store_history)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/util.py", line 74, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/util.py", line 53, in just_run
    return loop.run_until_complete(coro)
  File "/home/dotto/.local/lib/python3.8/site-packages/nest_asyncio.py", line 70, in run_until_complete
    return f.result()
  File "/usr/lib/python3.8/asyncio/futures.py", line 178, in result
    raise self._exception
  File "/usr/lib/python3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/client.py", line 852, in async_execute_cell
    self._check_raise_for_error(cell, exec_reply)
  File "/home/dotto/.local/lib/python3.8/site-packages/nbclient/client.py", line 760, in _check_raise_for_error
    raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
# loading 10x data
meta10x = pd.read_csv('./data/10x/scB5Tom_10x_samplemeta.csv')
meta10x.index = meta10x.biosample_id
adata10x = sc.read('./procdata/02script/combined10x_qced_logn.h5ad')

# loading SS2 data
adataSS2 = sc.read('./procdata/01script/SS2data.h5ad')
adataSS2 = adataSS2[:, adata10x.var.gene_ids]
adataSS2.var.index = adata10x.var.index

# We will be using logn data for everything
utils.proc.lognorm(adataSS2)
------------------

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-abc33e9fd834> in <module>
      5
      6 # loading SS2 data
----> 7 adataSS2 = sc.read('./procdata/01script/SS2data.h5ad')
      8 adataSS2 = adataSS2[:, adata10x.var.gene_ids]
      9 adataSS2.var.index = adata10x.var.index

/usr/local/lib/python3.8/dist-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, cache_compression, **kwargs)
    110     filename = Path(filename)  # allow passing strings
    111     if is_valid_filename(filename):
--> 112         return _read(
    113             filename,
    114             backed=backed,

/usr/local/lib/python3.8/dist-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, cache_compression, suppress_cache_warning, **kwargs)
    711     if ext in {'h5', 'h5ad'}:
    712         if sheet is None:
--> 713             return read_h5ad(filename, backed=backed)
    714         else:
    715             logg.debug(f'reading sheet {sheet} from file {filename}')

~/.local/lib/python3.8/site-packages/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    411     )
    412
--> 413     with h5py.File(filename, "r") as f:
    414         d = {}
    415         for k in f.keys():

~/.local/lib/python3.8/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    422             with phil:
    423                 fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0, **kwds)
--> 424                 fid = make_fid(name, mode, userblock_size,
    425                                fapl, fcpl=make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    426                                fs_persist=fs_persist, fs_threshold=fs_threshold),

~/.local/lib/python3.8/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    188         if swmr and swmr_support:
    189             flags |= h5f.ACC_SWMR_READ
--> 190         fid = h5f.open(name, flags, fapl=fapl)
    191     elif mode == 'r+':
    192         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

OSError: Unable to open file (unable to open file: name = 'procdata/01script/SS2data.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
OSError: Unable to open file (unable to open file: name = 'procdata/01script/SS2data.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

and indeed:

$ tree -h procdata
procdata
├── [  32]  01script
│   └── [6.9K]  QCperplate.pdf
├── [  44]  02script
│   └── [3.6G]  combined10x_qced_logn.h5ad
├── [   0]  04script
├── [   0]  05script
├── [   0]  06script
├── [   0]  07script
├── [   0]  22script
├── [   0]  23script
├── [   0]  24script
├── [   0]  26script
└── [   0]  36script

11 directories, 2 files

Even though 01_SS2_processing runs successfully, but inspecting 01_SS2_processing.md reveals:

dataQC = scQC(data,
              metadata = meta,
              splitby = "batch_plate_sorted",
              mitogenes = genedata[genedata$is_mito,"geneid"],
              id = "cellid",
              tr_mapped = 0.23,
              tr_mincounts=1e+05,
              tr_maxcounts=3e+07,
              tr_erccfrac=0.085,
              tr_higeneno = 2000,
              tr_mitofrac = 0.1,
              filebase = paste0(procdata_dir, "QCperplate"))
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following object is masked from 'package:matrixStats':
## 
##     count
## The following object is masked from 'package:biomaRt':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## [1] "Number of cells with less than 10% read mapped to mouse genome:  0"
## [1] "Number of cells with less than 100 000 reads:  0"
## Error in h(simpleError(msg, call)): error in evaluating the argument 'table' in selecting a method for function 'match': undefined columns selected

This then leads to an error when saving the anndata object later in the script. How can I fix this?

Iwo-K commented 8 months ago

Hi, Thanks for reporting this. Would you mind running the beginning of the 01_SS2_processing script to load all the libraries in the terminal (using Singularity shell) and post output of sessionInfo()?

We can also try to share the integrated AnnData object, if this is easier?

katosh commented 8 months ago

Sure:

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scran_1.20.1                scuttle_1.2.0              
##  [3] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0
##  [5] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [7] GenomeInfoDb_1.28.1         IRanges_2.26.0             
##  [9] S4Vectors_0.30.0            BiocGenerics_0.38.0        
## [11] MatrixGenerics_1.4.1        matrixStats_0.60.0         
## [13] anndata_0.7.5.2             viridis_0.6.1              
## [15] viridisLite_0.4.0           ggrepel_0.9.1              
## [17] ggplot2_3.3.5               reshape2_1.4.4             
## [19] biomaRt_2.48.2              devtools_2.4.2             
## [21] usethis_2.0.1               BiocManager_1.30.16        
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-2          ellipsis_0.3.2           
##  [3] rprojroot_2.0.2           bluster_1.2.1            
##  [5] XVector_0.32.0            BiocNeighbors_1.10.0     
##  [7] fs_1.5.0                  remotes_2.4.0            
##  [9] bit64_4.0.5               AnnotationDbi_1.54.1     
## [11] fansi_0.5.0               xml2_1.3.2               
## [13] sparseMatrixStats_1.4.0   cachem_1.0.5             
## [15] knitr_1.33                pkgload_1.2.1            
## [17] jsonlite_1.7.2            cluster_2.1.2            
## [19] dbplyr_2.1.1              png_0.1-7                
## [21] compiler_4.1.0            httr_1.4.2               
## [23] dqrng_0.3.0               assertthat_0.2.1         
## [25] Matrix_1.3-4              fastmap_1.1.0            
## [27] limma_3.48.1              cli_3.0.1                
## [29] BiocSingular_1.8.1        prettyunits_1.1.1        
## [31] tools_4.1.0               rsvd_1.0.5               
## [33] igraph_1.2.6              gtable_0.3.0             
## [35] glue_1.4.2                GenomeInfoDbData_1.2.6   
## [37] dplyr_1.0.7               rappdirs_0.3.3           
## [39] Rcpp_1.0.7                vctrs_0.3.8              
## [41] Biostrings_2.60.1         DelayedMatrixStats_1.14.2
## [43] xfun_0.24                 stringr_1.4.0            
## [45] ps_1.6.0                  testthat_3.0.4           
## [47] beachmat_2.8.0            lifecycle_1.0.0          
## [49] irlba_2.3.3               statmod_1.4.36           
## [51] XML_3.99-0.6              edgeR_3.34.0             
## [53] zlibbioc_1.38.0           scales_1.1.1             
## [55] hms_1.1.0                 curl_4.3.2               
## [57] memoise_2.0.0             reticulate_1.20          
## [59] gridExtra_2.3             stringi_1.7.3            
## [61] RSQLite_2.2.7             highr_0.9                
## [63] desc_1.3.0                ScaledMatrix_1.0.0       
## [65] filelock_1.0.2            pkgbuild_1.2.0           
## [67] BiocParallel_1.26.1       rlang_0.4.11             
## [69] pkgconfig_2.0.3           bitops_1.0-7             
## [71] evaluate_0.14             lattice_0.20-44          
## [73] purrr_0.3.4               bit_4.0.4                
## [75] processx_3.5.2            tidyselect_1.1.1         
## [77] plyr_1.8.6                magrittr_2.0.1           
## [79] R6_2.5.0                  generics_0.1.0           
## [81] metapod_1.0.0             DelayedArray_0.18.0      
## [83] DBI_1.1.1                 pillar_1.6.2             
## [85] withr_2.4.2               KEGGREST_1.32.0          
## [87] RCurl_1.98-1.3            tibble_3.1.3             
## [89] crayon_1.4.1              utf8_1.2.2               
## [91] BiocFileCache_2.0.0       progress_1.2.2           
## [93] locfit_1.5-9.4            grid_4.1.0               
## [95] blob_1.2.2                callr_3.7.0              
## [97] digest_0.6.27             munsell_0.5.0            
## [99] sessioninfo_1.1.1
katosh commented 8 months ago

The integrated anndata object would also work for me!

Iwo-K commented 8 months ago

It looks like the environment and the libraries are correct. I've just run this code step by step again on my machine and had no problems.

This error message is a bit cryptic to me. Are you sure data has been loaded correctly before the QC step? It should give a data.frame of dimensions: 54329, 1533 and then after QC 54329, 1288.

I will get in touch soon regarding the processed data.

katosh commented 8 months ago

Following https://github.com/Iwo-K/HSPCdynamics/blob/57bb4b9f822a0157429a456d5b1100d8d1a833a9/01_SS2_processing.R#L43

I get

> dim(data)
[1] 54329  1533

However since

> as.character(meta$cellid)
character(0)

the following https://github.com/Iwo-K/HSPCdynamics/blob/57bb4b9f822a0157429a456d5b1100d8d1a833a9/01_SS2_processing.R#L51 removes all columns:

> data = data[, as.character(meta$cellid)]
> dim(data)
[1] 54329     0

So, I looked at

> colnames(meta)
 [1] "X...cellid"        "RBG"               "SLX"
 [4] "plate_sorted"      "plate_rearranged"  "well_sorted"
 [7] "well_rearranged"   "set_index"         "CI_index"
[10] "index"             "mouse_platelabel"  "sort_method"
[13] "sample.name"       "population"        "expdate"
[16] "timepoint_tx_days" "biosample_id"      "mouse_id"
[19] "sex"               "start_age"         "tom"
[22] "batch"             "countfolder"       "sample_id"

and noticed that the there is no column named cellid but instead a broken looking X...cellid . So, I inspected ./data/SS2/scB5Tom_SS2_cellmeta.csv and found that it starts with <U+FEFF>, a Byte Order Mark (BOM):

<U+FEFF>cellid,RBG,SLX,plate_sorted,plate_rearranged,well_sorted,well_rearranged,set_index,CI_index,index,mouse_platelabel,sort_method,sample.name,population,expdate,timepoint_tx_days,biosample_id,mouse_id,sex,start_age,tom,batch,countfolder,sample_id
SLX19435.i701_i502,RBG35462,SLX19435,plate1,plate1,A1,A1,setA,i701_i502,N701-S502,mouse1_32c,singlecell,mouse1_32c_7d_Linneg_Kit+_OR_Sca1+_Tom+_singlecellmode,Linneg_Kit+_OR_Sca1+_Tom+,20201020,7,7d_SS2_14681.32c,14681.32c,female,young,pos,7d,./data/SS2/7d,7d_SS2
...

I obtained this data from Mendely data as described here and unpacked it with UnZip 6.00 of 20 April 2009, by Debian. that ships with Ubuntu 18.04.6 LTS.

To fix this I replaced https://github.com/Iwo-K/HSPCdynamics/blob/57bb4b9f822a0157429a456d5b1100d8d1a833a9/01_SS2_processing.R#L23 with

meta = read.csv("./data/SS2/scB5Tom_SS2_cellmeta.csv", as.is = TRUE, fileEncoding = "UTF-8-BOM")

and now the preprocessing script seems to works for me.

Iwo-K commented 8 months ago

I did not expect this to be honest, thank you very much for investigating. The original file also has the <U+FEFF> at the beginning, but calling regular meta = read.csv("./data/SS2/scB5Tom_SS2_cellmeta.csv", as.is = TRUE) returns everything correctly. I will update the README to point to this issue as a fix.

I've uploaded the combined_filtered_landscape.h5ad to Mendeley Data, this should go through moderation in the next couple of days. I will post here (and the README) the link when this is ready. This is an AnnData object with all data integrated together (after removal of undesired populations) . This should more convenient and make it easier to follow the analysis from script 05 onwards.

Iwo-K commented 8 months ago

Processed data (combined_filtered_landscape.h5ad) is now available here: https://doi.org/10.17632/vwg6xzmrf9.2

katosh commented 8 months ago

Hi @Iwo-K, thank you so much for the processed data, it looks great and is going to be very helpful!!

I also tried completing the pipeline but I noticed that the last step of the preprocessing failed after all:

> adata = AnnData(X = t(dataQC), obs = metaQC, var = genedata)
## Error in py_module_import(module, convert = convert): ModuleNotFoundError: No module named 'anndata'
## 
## Detailed traceback:
##   File "/usr/local/lib/R/site-library/reticulate/python/rpytools/loader.py", line 39, in _import_hook
##     module = _import(

Hence the Python that is being used through reticulate by the anndata R package, does not seem to have the anndata Python package available. So, I checked which Python the containered R is using by running

$ singularity exec rpy_v4_p3_fix2.sif R -e 'library(reticulate); py_config()'

R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

During startup - Warning messages:
1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called 'colorout'
2: package 'colorout' in options("defaultPackages") was not found
3: Setting LC_CTYPE failed, using "C"
4: Setting LC_COLLATE failed, using "C"
5: Setting LC_TIME failed, using "C"
6: Setting LC_MESSAGES failed, using "C"
7: Setting LC_MONETARY failed, using "C"
8: Setting LC_PAPER failed, using "C"
9: Setting LC_MEASUREMENT failed, using "C"
> library(reticulate); py_config()
python:         /home/dotto/.local/share/r-miniconda/envs/r-reticulate/bin/python
libpython:      /home/dotto/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.9.so
pythonhome:     /home/dotto/.local/share/r-miniconda/envs/r-reticulate:/home/dotto/.local/share/r-miniconda/envs/r-reticulate
version:        3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0]
numpy:          /home/dotto/.local/lib/python3.9/site-packages/numpy
numpy_version:  1.24.4
>

and ideed, even though it is running inside my container, it is using the Python interpreter I have installed outside of the container. So, I tried fixing this by placing these lines on top of 01_SS2_processing.R:

reticulate::use_python("/bin/python3")

But I only got this message when saving the anndata:

## Warning: Python '/bin/python3' was requested but '/home/dotto/.local/
## share/r-miniconda/envs/r-reticulate/bin/python' was loaded instead (see
## reticulate::py_config() for more information)

So, I am not sure how to fix this. But this is just FYI, and since I was only trying to run this to get the processed data.

Thank you!