Bioconductor / basilisk

Clone of the Bioconductor repository for the basilisk package.
https://bioconductor.org/packages/devel/bioc/html/basilisk.html
GNU General Public License v3.0
27 stars 14 forks source link

Error on basilisk env for spatialDE wrapper #20

Open davidecrs opened 2 years ago

davidecrs commented 2 years ago

Hi @LTLA ,

I'm not sure if this is an error of basilisk, but it happened when the basilisk environment is created to run functions. Specifically, recently the Linux build of SpatialDE http://bioconductor.org/checkResults/release/bioc-LATEST/spatialDE/nebbiolo1-buildsrc.html showed this error (without making any changes to the package).

Using a docker container, I was able to successfully install the spatialDE package from Bioc, but when I tried to run our example

library(spatialDE)
spe <- mockSVG(return_SPE = TRUE)
de_results <- spatialDE(spe)

it creates the basilisk and the conda env with the python libraries, but it fails showing the same error as the Linux build

Error in py_module_import(module, convert = convert) : 
  ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /root/.cache/R/basilisk/1.8.0/spatialDE/1.2.0/env/lib/python3.8/site-packages/scipy/optimize/_highs/_highs_wrapper.cpython-38-x86_64-linux-gnu.so)
This is the sessionInfo() R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye) Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] spatialDE_1.2.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.8.3 locfit_1.5-9.5 [3] here_1.0.1 dir.expiry_1.4.0 [5] lattice_0.20-45 png_0.1-7 [7] rprojroot_2.0.3 assertthat_0.2.1 [9] SingleCellExperiment_1.18.0 utf8_1.2.2 [11] R6_2.5.1 GenomeInfoDb_1.32.1 [13] backports_1.4.1 stats4_4.2.0 [15] ggplot2_3.3.6 pillar_1.7.0 [17] sparseMatrixStats_1.8.0 basilisk_1.8.0 [19] zlibbioc_1.42.0 rlang_1.0.2 [21] magick_2.7.3 S4Vectors_0.34.0 [23] R.utils_2.11.0 R.oo_1.24.0 [25] Matrix_1.4-1 checkmate_2.1.0 [27] reticulate_1.24 BiocParallel_1.30.0 [29] RCurl_1.98-1.6 munsell_0.5.0 [31] beachmat_2.12.0 DelayedArray_0.22.0 [33] HDF5Array_1.24.0 compiler_4.2.0 [35] DropletUtils_1.16.0 pkgconfig_2.0.3 [37] BiocGenerics_0.42.0 tidyselect_1.1.2 [39] SummarizedExperiment_1.26.1 gridExtra_2.3 [41] tibble_3.1.7 GenomeInfoDbData_1.2.8 [43] edgeR_3.38.0 IRanges_2.30.0 [45] matrixStats_0.62.0 fansi_1.0.3 [47] crayon_1.5.1 dplyr_1.0.9 [49] basilisk.utils_1.8.0 rhdf5filters_1.8.0 [51] bitops_1.0-7 R.methodsS3_1.8.1 [53] grid_4.2.0 jsonlite_1.8.0 [55] gtable_0.3.0 lifecycle_1.0.1 [57] DBI_1.1.2 magrittr_2.0.3 [59] scales_1.2.0 dqrng_0.3.0 [61] cli_3.3.0 scuttle_1.6.0 [63] XVector_0.36.0 SpatialExperiment_1.6.0 [65] limma_3.52.0 filelock_1.0.2 [67] DelayedMatrixStats_1.18.0 ellipsis_0.3.2 [69] generics_0.1.2 vctrs_0.4.1 [71] Rhdf5lib_1.18.0 rjson_0.2.21 [73] tools_4.2.0 Biobase_2.56.0 [75] glue_1.6.2 purrr_0.3.4 [77] MatrixGenerics_1.8.0 parallel_4.2.0 [79] colorspace_2.0-3 rhdf5_2.40.0 [81] BiocManager_1.30.17 GenomicRanges_1.48.0

Do I have to add something to the code of the basiliskEnv?

Best regards

vjcitn commented 2 years ago

I have seen this. The solution is to find a suitable libstdc++.so.6 in your cache and put it in /usr/lib/x86_64-linux-gnu. I will provide some instructions in the next comment.

vjcitn commented 2 years ago

I should have said "a solution". I don't like this at all but it works. On linux I used

stvjc@stvjc-XPS-13-9300:~$ find . -name libstdc++.so.6 -print
./.cache/R/basilisk/1.9.4/0/lib/libstdc++.so.6
./.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-12.1.0-ha89aaad_16/lib/libstdc++.so.6
./.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-9.3.0-hd4cf53a_17/lib/libstdc++.so.6
./.cache/R/basilisk/1.9.4/BiocSklearn/1.19.12/bsklenv/lib/libstdc++.so.6
^C
stvjc@stvjc-XPS-13-9300:~$ strings ./.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-12.1.0-ha89aaad_16/lib/libstdc++.so.6 | grep GLIBCXX_3.4.30 |head
GLIBCXX_3.4.30
GLIBCXX_3.4.30

That's the file to place in /usr/lib/x86_64-linux-gnu. There may be better solutions involving a system update (perhaps to the C++ infrastructure) but I did not find any.

LTLA commented 2 years ago

This is, unfortunately, a fairly difficult problem. Here is my diagnosis:

Normally, the conda installation comes packaged with its own libstdc++. This guarantees that any conda packages (and the conda-packaged Python) will work as they have been compiled against this specific version of libstdc++.

Now, consider an R installation on the same computer. This has been compiled against the system libstdc++, which may not be the same as conda's. Normally, this would not be a problem, as R would use the system libstdc++ while conda would use its own... except...

When we use reticulate, it seems that we force the conda-derived C libraries (e.g., for scipy) to use R's libstdc++. I'm not really sure how the dynamic linker works, but I would speculate that it detects that something from libstdc++ is already present (via R) and thus it won't/can't load in another version (from conda).

Environment activation doesn't seem to help; I don't think there's any environment variables that would influence this outcome. Maybe one could fiddle with LD_LIBRARY_PATH but this seems like a dangerous game.

If my diagnosis is correct, then the solutions suck:

  1. Don't use reticulate. Everything seems fine if Python is called as a separate process, in which case it is able to resolve its own rpaths. Sometimes this is an acceptable solution if you're not doing a lot of data transfer between R and Python; for example, crisprScore uses this approach as reticulate no longer supports Python 2.7.
  2. Require the user to have a consistent libstdc++ between their R installation and conda. Not good.
  3. Don't use conda at all. This would require the user to have an appropriate version of Python already installed. Not good.

I suppose we could mimic 2 by re-installing (!) R inside conda and then serializing a function across to the conda-specific R, which would allow you to use reticulate inside a version of R that runs on the same libstdc++ as the Python packages. Nasty stuff, not least because it would increase the size of the conda environments by ~500 MB at least. But at least it'll run. (And maybe the burden on the user can be reduced by having a central R installation managed by basilisk itself, then at least it'll just be created once for the entire installation and not for each environment.)

hpages commented 2 years ago
  1. Don't use conda at all. This would require the user to have an appropriate version of Python already installed. Not good.

Why? Python 3 is on almost every system those days, and, if not (e.g. on Windows) is super easy to install.

hpages commented 2 years ago

Interestingly we see this problem on nebbiolo1 (BioC 3.15) but not on nebbiolo2 (BioC 3.16) where /usr/lib/x86_64-linux-gnu/libstdc++.so.6 is exactly the same (both machines have Ubuntu 20.04):

Also spatialDE has not changed between release and devel so something must have changed somewhere (in basilisk, reticulate, miniconda, etc...?) that solves the problem. This gives me hope that we won't need to replace a core component like /usr/lib/x86_64-linux-gnu/libstdc++.so.6 with something copied from a random place :wink:

hpages commented 2 years ago

And FWIW I don't get this on my laptop using BioC release (like nebbiolo1):

library(spatialDE)
spe <- mockSVG(return_SPE = TRUE)
# Note: spatialData and spatialDataNames have been deprecated; all columns should be stored in
# colData and spatialCoords
de_results <- spatialDE(spe)
dim(de_results)
# [1] 1000   18

Major difference with nebbiolo1 is that I have Ubuntu 22.04 (libstdc++.so.6.0.30) and nebbiolo1 has Ubuntu 20.04 (libstdc++.so.6.0.28).

sessionInfo():

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.2.0/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.0/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] spatialDE_1.2.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9                  locfit_1.5-9.6             
 [3] here_1.0.1                  dir.expiry_1.4.0           
 [5] lattice_0.20-45             png_0.1-7                  
 [7] rprojroot_2.0.3             assertthat_0.2.1           
 [9] SingleCellExperiment_1.18.0 utf8_1.2.2                 
[11] R6_2.5.1                    GenomeInfoDb_1.32.3        
[13] backports_1.4.1             stats4_4.2.0               
[15] ggplot2_3.3.6               pillar_1.8.1               
[17] basilisk_1.8.1              sparseMatrixStats_1.8.0    
[19] zlibbioc_1.42.0             rlang_1.0.5                
[21] magick_2.7.3                S4Vectors_0.34.0           
[23] R.utils_2.12.0              R.oo_1.25.0                
[25] Matrix_1.4-1                checkmate_2.1.0            
[27] reticulate_1.26             BiocParallel_1.30.3        
[29] RCurl_1.98-1.8              munsell_0.5.0              
[31] beachmat_2.12.0             DelayedArray_0.22.0        
[33] HDF5Array_1.24.2            compiler_4.2.0             
[35] DropletUtils_1.16.0         pkgconfig_2.0.3            
[37] BiocGenerics_0.42.0         tidyselect_1.1.2           
[39] SummarizedExperiment_1.26.1 gridExtra_2.3              
[41] tibble_3.1.8                GenomeInfoDbData_1.2.8     
[43] edgeR_3.38.4                IRanges_2.30.1             
[45] codetools_0.2-18            matrixStats_0.62.0         
[47] fansi_1.0.3                 dplyr_1.0.10               
[49] basilisk.utils_1.8.0        rhdf5filters_1.8.0         
[51] bitops_1.0-7                R.methodsS3_1.8.2          
[53] grid_4.2.0                  jsonlite_1.8.0             
[55] gtable_0.3.1                lifecycle_1.0.1            
[57] DBI_1.1.3                   magrittr_2.0.3             
[59] scales_1.2.1                dqrng_0.3.0                
[61] cli_3.3.0                   scuttle_1.6.3              
[63] XVector_0.36.0              SpatialExperiment_1.6.1    
[65] limma_3.52.2                filelock_1.0.2             
[67] DelayedMatrixStats_1.18.0   generics_0.1.3             
[69] vctrs_0.4.1                 Rhdf5lib_1.18.2            
[71] rjson_0.2.21                tools_4.2.0                
[73] Biobase_2.56.0              glue_1.6.2                 
[75] purrr_0.3.4                 MatrixGenerics_1.8.1       
[77] parallel_4.2.0              colorspace_2.0-3           
[79] rhdf5_2.40.0                GenomicRanges_1.48.0       
LTLA commented 2 years ago

Why? Python 3 is on almost every system those days, and, if not (e.g. on Windows) is super easy to install.

The minor version matters, e.g., pandas 1.4.3 requires >= 3.8. I don't think one can assume that the latest Python version is available. (My laptop has 3.6.9, for example.) And certainly not if client environments require different Python versions.

Also spatialDE has not changed between release and devel so something must have changed somewhere (in basilisk, reticulate, miniconda, etc...?) that solves the problem.

Hm. The last update to basilisk in release was to change an internal license to please the Debian folks. No code changes.

reticulate updated yesterday. Their NEWS.md doesn't indicate anything too offensive. Nonetheless, @davidecrs it might be worth trying to install reticulate 1.25 into the Docker image and seeing whether the problem persists.

It's always possible that conda itself changed the binaries some time after April, and it only got picked up by the build system now. This is because basilisk caches environments from all client packages so they aren't re-built during tests; however, all client caches are cleared upon a basilisk version bump. So, when I bumped basilisk, it would have prompted the construction of fresh Conda environments in clients like spatialDE, possibly with modified + incompatible binaries.

On that note...

Interestingly we see this problem on nebbiolo1 (BioC 3.15) but not on nebbiolo2 (BioC 3.16) where /usr/lib/x86_64-linux-gnu/libstdc++.so.6 is exactly the same (both machines have Ubuntu 20.04):

I notice that - as of time of writing - the last commits listed on BioC-devel are:

Snapshot Date: 2022-08-31 14:00:03 -0400 (Wed, 31 Aug 2022)
git_url: https://git.bioconductor.org/packages/basilisk
git_branch: master
git_last_commit: 27244b5
git_last_commit_date: 2022-06-25 03:24:24 -0400 (Sat, 25 Jun 2022)

This suggests one possibility: the only reason that we haven't encountered errors in nebbiolo2 is because we're still using the cached environments from June. My version bumps haven't made it through to BioC-devel yet. Once they do, the caches will be cleared and the environments will be reconstructed, possibly with the incompatible libstdc++.

I guess we'll find out soon enough when the next build rolls around. Amusingly enough, the June commit message is:

commit 27244b5ff0d640df0727d20acabba43520889dea
Author: LTLA <infinite.monkeys.with.keyboards@gmail.com>
Date:   Sat Jun 25 00:24:24 2022 -0700

    Pin the scipy version to avoid GLIBCXX errors.
davidecrs commented 2 years ago

Thanks everyone for your answers.

As @LTLA suggested, I tried the example code with reticulate 1.25 but it shows the same error

Error in py_module_import(module, convert = convert) : 
  ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /root/.cache/R/basilisk/1.8.0/spatialDE/1.2.0/env/lib/python3.8/site-packages/scipy/optimize/_highs/_highs_wrapper.cpython-38-x86_64-linux-gnu.so)
sessionInfo() ```r R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye) Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] spatialDE_1.2.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.8.3 locfit_1.5-9.5 [3] here_1.0.1 dir.expiry_1.4.0 [5] lattice_0.20-45 png_0.1-7 [7] rprojroot_2.0.3 assertthat_0.2.1 [9] SingleCellExperiment_1.18.0 utf8_1.2.2 [11] R6_2.5.1 GenomeInfoDb_1.32.1 [13] backports_1.4.1 stats4_4.2.0 [15] ggplot2_3.3.6 pillar_1.7.0 [17] sparseMatrixStats_1.8.0 basilisk_1.8.0 [19] zlibbioc_1.42.0 rlang_1.0.2 [21] magick_2.7.3 S4Vectors_0.34.0 [23] R.utils_2.11.0 R.oo_1.24.0 [25] Matrix_1.4-1 checkmate_2.1.0 [27] reticulate_1.25 BiocParallel_1.30.0 [29] RCurl_1.98-1.6 munsell_0.5.0 [31] beachmat_2.12.0 DelayedArray_0.22.0 [33] HDF5Array_1.24.0 compiler_4.2.0 [35] DropletUtils_1.16.0 pkgconfig_2.0.3 [37] BiocGenerics_0.42.0 tidyselect_1.1.2 [39] SummarizedExperiment_1.26.1 gridExtra_2.3 [41] tibble_3.1.7 GenomeInfoDbData_1.2.8 [43] edgeR_3.38.0 IRanges_2.30.0 [45] matrixStats_0.62.0 fansi_1.0.3 [47] crayon_1.5.1 dplyr_1.0.9 [49] basilisk.utils_1.8.0 rhdf5filters_1.8.0 [51] bitops_1.0-7 R.methodsS3_1.8.1 [53] grid_4.2.0 jsonlite_1.8.0 [55] gtable_0.3.0 lifecycle_1.0.1 [57] DBI_1.1.2 magrittr_2.0.3 [59] scales_1.2.0 dqrng_0.3.0 [61] cli_3.3.0 scuttle_1.6.0 [63] XVector_0.36.0 remotes_2.4.2 [65] SpatialExperiment_1.6.0 limma_3.52.0 [67] filelock_1.0.2 DelayedMatrixStats_1.18.0 [69] ellipsis_0.3.2 generics_0.1.2 [71] vctrs_0.4.1 Rhdf5lib_1.18.0 [73] rjson_0.2.21 tools_4.2.0 [75] Biobase_2.56.0 glue_1.6.2 [77] purrr_0.3.4 MatrixGenerics_1.8.0 [79] parallel_4.2.0 colorspace_2.0-3 [81] rhdf5_2.40.0 BiocManager_1.30.17 [83] GenomicRanges_1.48.0 ```
vjcitn commented 2 years ago

I am hoping this will not be a chronic problem.

As long as the g++ infrastructure used by Bioc BBS (and users) is as "mature" as that employed in the miniconda3 system/ecosystem used by basilisk (over which we have control) we should be able to state conditions under which this failure won't occur. We jumped ahead in the choice of miniconda3 version because only a very recent one supported both intel and ARM macOS. How to learn about the libstdc++.so versions used by a) our apt-gets for g++ and b) conda components, does not seem particularly simple. I guess on linux we can see .so.6.0.nn as a symlink target for .so.6. Looks like the nn is available for the conda components too:

stvjc@stvjc-XPS-13-9300:~$ find ~/.cache -name libstdc++.so* -print
/home/stvjc/.cache/R/basilisk/1.9.4/0/lib/libstdc++.so
/home/stvjc/.cache/R/basilisk/1.9.4/0/lib/libstdc++.so.6
/home/stvjc/.cache/R/basilisk/1.9.4/0/lib/libstdc++.so.6.0.28
/home/stvjc/.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-12.1.0-ha89aaad_16/lib/libstdc++.so.6.0.30
/home/stvjc/.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-12.1.0-ha89aaad_16/lib/libstdc++.so
/home/stvjc/.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-12.1.0-ha89aaad_16/lib/libstdc++.so.6
/home/stvjc/.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-9.3.0-hd4cf53a_17/lib/libstdc++.so
/home/stvjc/.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-9.3.0-hd4cf53a_17/lib/libstdc++.so.6
/home/stvjc/.cache/R/basilisk/1.9.4/0/pkgs/libstdcxx-ng-9.3.0-hd4cf53a_17/lib/libstdc++.so.6.0.28
/home/stvjc/.cache/R/basilisk/1.9.4/BiocSklearn/1.19.12/bsklenv/lib/libstdc++.so.6.0.30
/home/stvjc/.cache/R/basilisk/1.9.4/BiocSklearn/1.19.12/bsklenv/lib/libstdc++.so
/home/stvjc/.cache/R/basilisk/1.9.4/BiocSklearn/1.19.12/bsklenv/lib/libstdc++.so.6

So I'd think there's a way of catching the mismatch and providing guidance on how to solve. And, down the road, we can avoid mismatches in just the way that we have in the 2.5 year history of basilisk, by verifying compatibilities between BBS and miniconda components in use by basilisk.

LTLA commented 2 years ago

From some testing on the BioC-devel Docker image, it seems that conda is grabbing the very latest acceptable version of scipy. This version requires a later version of libstdc++ than is present in the system.

I realized that SpatialDE doesn't pin the versions of all its dependencies, which means that conda is free to resolve to the latest version of scipy. After some hellish trial and error, 1.8.1 seems like the last working version. So, I get:

> de_results[1:10,]
           FSV M         g   l    max_delta     max_ll max_mu_hat max_s2_t_hat
0 9.999543e-01 4 Gene_0001 0.5 4.539993e-05 11.2392590  -50.87857 1.680494e+03
1 2.049747e-09 4 Gene_0002 0.5 4.851652e+08 28.8429821  -52.46748 5.674087e-06
2 2.595417e-01 4 Gene_0004 0.5 2.837157e+00 14.9986268  -47.64689 5.179961e+02

Moral of the story? Pin your package versions.

On a more general note, this episode implies that, if reticulate must be used, conda is not able to fully insulate developers from the vagaries of the operating system. My attempt to install a conda-managed R didn't seem to work as it comes with its own libstdc++ that is still different from scipy's requirements, so we're back at the same problem.

LTLA commented 2 years ago

It seems that the build system is not without a sense of irony:

http://bioconductor.org/checkResults/devel/bioc-LATEST/basilisk/nebbiolo2-checksrc.html

In particular:

  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  Error in `checkForRemoteErrors(lapply(cl, recvResult))`: one node produced an error: ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/biocbuild/.cache/R/basilisk/1.9.4/son.of.basilisk/0.99.0/env2/lib/python3.8/site-packages/scipy/optimize/_highs/_highs_wrapper.cpython-38-x86_64-linux-gnu.so)

  Backtrace:
      ▆
   1. └─son.of.basilisk::test() at test-package.R:14:4
   2.   └─basilisk::basiliskRun(...)
   3.     └─parallel::clusterCall(proc, fun = fun, ...)
   4.       └─parallel:::checkForRemoteErrors(lapply(cl, recvResult))

  [ FAIL 1 | WARN 0 | SKIP 0 | PASS 45 ]
  Error: Test failures
  In addition: Warning message:
  In for (i in seq_len(n)) { :
    closing unused connection 5 (<-localhost:11320)

I have prepared a working solution based on a Conda-supplied internal R, which does seem to show some promise after a bit more elbow grease. You can try it out by installing LTLA/basilisk.utils#6 followed by #21. It requires you to pass in a character vector of "problematic" packages in your basiliskRun or basiliskStart call (typically testload="scipy.optimize" or something like that); if the load attempt fails due to a GLIBCXX error, basilisk will automatically fall back to its Conda-supplied R.

LTLA commented 2 years ago

Fixes to basilisk (and basilisk.utils) have been pushed to BioC-devel. It doesn't really "solve" the GLIBCXX mismatch but it provides a fallback mechanism to execute reticulate code that relies on the affected Python packages. Hopefully this will build in the next few days - keep an eye out for 1.9.5.

@davidecrs Some refactoring of your package will be necessary for basilisk to trigger the fallback. basiliskStart (and basiliskRun) now supports a testload= argument where you can pass a character vector of potentially problematic packages. basilisk will attempt to load them before running your function, switching to the fallback if a GLIBCXX mismatch is detected. For example, you could modify stabilize to contain:

    out <- basilisk::basiliskRun(
        env = spatialDE_env,
        fun = .naiveDE_stabilize,
        counts = counts,
        testload = "scipy.optimize"
    )

However, one major limitation of the fallback mechanism is that the function (in fun=) can only contain base-R or reticulate function calls. This is because the fallback mechanism uses a lightweight internal copy of R that only has reticulate installed, so literally nothing else is available - not even other functions in your own package. Thus, functions like .run_spatialDE cannot be used inside any basiliskRun calls that might use the fallback.

That said, .run_spatialDE is a rather suboptimal design, as it requires basilisk to potentially spin up a new process for each internal call to .naiveDE_stabilize, .naiveDE_regress_out and .spatialDE_run. A better approach would be something like:

.naiveDE_stabilize2 <- function(proc, counts) {
    basiliskRun(proc, function(counts, store) {
        naiveDE <- import("NaiveDE")

        ## NaiveDE.stabilize requires data.frame input to work
        df_py <- r_to_py(as.data.frame(counts))

       store$stabilized <- naiveDE$stabilize(df_py)
       invisible(NULL)
    }, counts=counts, persist=TRUE)
}

.naiveDE_regress_out2 <- function(proc, sample_info) {
    basiliskRun(proc, function(sample_info, store) {
        naiveDE <- import("NaiveDE")

        sample_info_py <- r_to_py(sample_info)

        store$res <- naiveDE$regress_out(sample_info_py, store$stabilized, "np.log(total_counts)")

        invisible(NULL)
    }, sample_info=sample_info, persist=TRUE)
}

which allows you to do:

proc <- basiliskStart(spatialDE_env, testload="scipy.optimize")
# on.exit(basiliskStop(proc)) # run inside a function
.naiveDE_stabilize2(proc, counts)
.naiveDE_regress_out2(proc, sample_info)

# If you want to get any of the stored results, you could do:
basiliskRun(proc, function(store) {
    as.matrix(store$stabilized)
}, persist=TRUE)

You can see that we re-use the same process for multiple calls to Python within the SpatialDE workflow, which is more efficient than spinning up a new process and transferring data to/from it for every step. This approach also ensures that we only use base R or reticulate functions within each basiliskRun invocation, thus staying within the fallback's constraints.

davidecrs commented 2 years ago

Thanks @LTLA,

I will try to update the package with these changes.

davidecrs commented 2 years ago

Hi @LTLA,

I'm trying to update the package with the suggested fixes, but I get an error. Where am I doing wrong? This minimal example can reproduce it

library(spatialDE)
library(basilisk)
library(SpatialExperiment)

spatialDE_dependencies <- c(
    "python==3.8.5",
    "pandas==0.25.3",
    "patsy==0.5.1",
    "pip==21.0.1"
)

spatialDE_env <- BasiliskEnvironment(
    envname = "env", pkgname = "spatialDE",
    packages = spatialDE_dependencies,
    pip = c("SpatialDE==1.1.3", "NaiveDE==1.2.0")
)

naiveDE_stabilize <- function(proc, counts) {
    basiliskRun(proc, function(counts, store) {
        naiveDE <- import("NaiveDE")

        ## NaiveDE.stabilize requires data.frame input to work
        df_py <- r_to_py(as.data.frame(counts))

        store$stabilized <- naiveDE$stabilize(df_py)
        invisible(NULL)
    }, counts=counts, persist=TRUE)
}

spe <- mockSVG(return_SPE=T)
spatialCoordsNames(spe) <- c("x", "y")
coordinates <- as.data.frame(spatialCoords(spe))

x <- SummarizedExperiment::assay(spe, "counts")

proc <- basiliskStart(spatialDE_env, testload="scipy.optimize")
naiveDE_stabilize(proc, x)

This is the error

Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  one node produced an error: unused argument (counts = base::quote(c(24, 32, 29, 29, 23, 24, 41, 22, 55, 20, 23, 20, 20, 21, 25, 23, 21, 21, 22, 24, 21, 33, 25, 24, 41, 20, 24, 20, 33, 21, 40, 26, 30, 24, 43, 20, 38, 23, 33, 20, 24, 21, 21, 20, 26, 27, 39, 23, 22, 35, 36, 20, 20, 32, 61, 20, 25, 32, 23, 50, 21, 23, 26, 38, 22, 33, 22, 20, 24, 49, 30, 26, 21, 30, 43, 21, 21, 36, 20, 37, 24, 20, 22, 22, 20, 34, 20, 27, 20, 22, 26, 22, 24, 33, 21, 22, 29, 35, 48, 28, 0, 9, 1, 0, 16, 8, 11, 0, 25, 1, 12, 6, 23, 5, 28, 2, 11, 0, 3, 0, 1, 
8, 8, 20, 1, 16, 0, 21, 3, 1, 16, 26, 7, 27, 20, 7, 4, 4, 3, 13, 12, 0, 4, 3, 17, 3, 17, 12, 2, 2, 4, 13, 2, 37, 1, 1, 2, 3, 9, 5, 1, 3, 8, 10, 6, 11, 0, 3, 7, 27, 4, 8, 30, 0, 9, 1, 0, 3, 0, 0, 7, 1, 2, 0, 1, 6, 0, 2, 1, 2, 22, 2, 6, 4, 3, 0, 1, 6, 31, 1, 4, 1, 5, 32, 1, 19, 38, 1, 37, 1, 0, 7, 9, 1, 21, 2, 0, 5, 10, 8, 6, 2, 4, 1, 0, 2, 22, 6, 2, 0, 2, 32, 11, 3, 3, 0, 1, 10, 3, 3, 0, 2, 1, 4, 11, 5, 19, 0, 2, 9, 0, 1, 23, 10, 1, 4, 0, 19, 6, 0, 
This is the session info ```r > sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=it_IT.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] reticulate_1.26 SpatialExperiment_1.7.2 SingleCellExperiment_1.19.0 [4] SummarizedExperiment_1.27.2 Biobase_2.57.1 GenomicRanges_1.49.1 [7] GenomeInfoDb_1.33.7 IRanges_2.31.2 S4Vectors_0.35.3 [10] BiocGenerics_0.43.3 MatrixGenerics_1.9.1 matrixStats_0.62.0 [13] basilisk_1.9.6 spatialDE_1.3.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.9 locfit_1.5-9.6 here_1.0.1 [4] dir.expiry_1.5.1 lattice_0.20-45 png_0.1-7 [7] rprojroot_2.0.3 assertthat_0.2.1 utf8_1.2.2 [10] R6_2.5.1 backports_1.4.1 ggplot2_3.3.6 [13] pillar_1.8.1 sparseMatrixStats_1.9.0 zlibbioc_1.43.0 [16] rlang_1.0.5 rstudioapi_0.14 magick_2.7.3 [19] R.utils_2.12.0 R.oo_1.25.0 Matrix_1.5-0 [22] checkmate_2.1.0 BiocParallel_1.31.12 RCurl_1.98-1.8 [25] munsell_0.5.0 beachmat_2.13.4 DelayedArray_0.23.1 [28] HDF5Array_1.25.2 compiler_4.2.1 DropletUtils_1.17.1 [31] pkgconfig_2.0.3 tidyselect_1.1.2 gridExtra_2.3 [34] tibble_3.1.8 GenomeInfoDbData_1.2.8 edgeR_3.39.6 [37] codetools_0.2-18 fansi_1.0.3 dplyr_1.0.10 [40] basilisk.utils_1.9.3 rhdf5filters_1.9.0 bitops_1.0-7 [43] R.methodsS3_1.8.2 grid_4.2.1 jsonlite_1.8.0 [46] gtable_0.3.1 lifecycle_1.0.2 DBI_1.1.3 [49] magrittr_2.0.3 scales_1.2.1 dqrng_0.3.0 [52] cli_3.4.0 scuttle_1.7.4 XVector_0.37.1 [55] limma_3.53.6 filelock_1.0.2 DelayedMatrixStats_1.19.0 [58] generics_0.1.3 vctrs_0.4.1 rjson_0.2.21 [61] Rhdf5lib_1.19.2 tools_4.2.1 glue_1.6.2 [64] purrr_0.3.4 parallel_4.2.1 yaml_2.3.5 [67] colorspace_2.0-3 rhdf5_2.41.1 ```
LTLA commented 2 years ago

Oops. Hopefully 1.9.10 will fix this. Try installing basilisk from GitHub if you want it sooner.