Bioconductor / basilisk

Clone of the Bioconductor repository for the basilisk package.
https://bioconductor.org/packages/devel/bioc/html/basilisk.html
GNU General Public License v3.0
27 stars 14 forks source link

Feasibility check #1

Closed LTLA closed 4 years ago

LTLA commented 4 years ago

Tagging @hpages and @grimbough.

The idea is to package up a self-contained instance of Python for distribution via the R package management system. This aims to control the version of both Python and all of its modules for a given Bioconductor release, thus allowing BioC packages to reliably deploy with Python code.

The current version of this package works well enough on Unix, provided libffi is available. The tricky bit is what to do with the pre-built binaries for macs and windows. The Python executable can either be linked to a shared library or a static library, and while the latter is more obviously portable, the former is necessary for reticulate to work at all. And we really want reticulate to work here, otherwise no one's going to get on board with this solution.

So, this means that we need to distribute the compiled shared library along with the executable, which is not great. However, R already faces (and solves) this challenge when it distributes pre-built binaries of any package with native code, as these also contain shared libraries. Casual inspection of those libraries indicate that they have hard-coded paths like:

scran.so:
    scran.so (compatibility version 0.0.0, current version 0.0.0)
    /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib (compatibility version 3.6.0, current version 3.6.1)
    /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.dylib (compatibility version 0.0.0, current version 0.0.0)
    /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0)
    /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libquadmath.0.dylib (compatibility version 1.0.0, current version 1.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1226.10.1)
    /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libR.dylib (compatibility version 3.6.0, current version 3.6.1)
    /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1259.22.0)
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.1.0)

So in principle, we could just piggy-back off that by hard-coding similar paths into the Python shared library (which would also be dumped somewhere in the basilisk installation directory, so it wouldn't be any different from what we see above).

I also had some difficulties linking to OpenSSL on my mac, but I don't know whether that would be a more general problem or just something to do with my system. Looking at the openssl configure file suggests that some fiddling is required there.

hpages commented 4 years ago

It failed to install for me. Showing only the truncated output of R CMD INSTALL basilisk:

First 55 lines:

* installing to library ‘/home/hpages/R/R-3.6.r76454/library’
* installing *source* package ‘basilisk’ ...
** using non-staged installation via StagedInstall field
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
downloading Python 3.7.0...
--2019-10-08 18:00:11--  https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz
Resolving www.python.org (www.python.org)... 151.101.52.223, 2a04:4e42:d::223
Connecting to www.python.org (www.python.org)|151.101.52.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22745726 (22M) [application/octet-stream]
Saving to: 'Python-3.7.0.tgz'

Python-3.7.0.tgz    100%[===================>]  21.69M  3.46MB/s    in 6.4s    

2019-10-08 18:00:17 (3.41 MB/s) - 'Python-3.7.0.tgz' saved [22745726/22745726]

building Python...
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking for python3.7... no
checking for python3... python3
checking for --enable-universalsdk... no
checking for --with-universal-archs... no
checking MACHDEP... checking for --without-gcc... no
checking for --with-icc... no
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /bin/grep
checking for a sed that does not truncate output... /bin/sed
checking for --with-cxx-main=<compiler>... no
checking for g++... no
configure:

  By default, distutils will build C++ extension modules with "g++".
  If this is not intended, then set CXX on the configure command line.

checking for the platform triplet based on compiler characteristics... x86_64-linux-gnu
checking for -Wl,--no-as-needed... yes
checking for egrep... /bin/grep -E
...

Last 65 lines:

...
rm -f /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin/idle3
(cd /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin; ln -s idle3.7 idle3)
rm -f /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin/pydoc3
(cd /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin; ln -s pydoc3.7 pydoc3)
rm -f /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin/2to3
(cd /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin; ln -s 2to3-3.7 2to3)
rm -f /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin/pyvenv
(cd /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin; ln -s pyvenv-3.7 pyvenv)
if test "x" != "x" ; then \
    rm -f /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin/python3-32; \
    (cd /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/bin; ln -s python3.7-32 python3-32) \
fi
rm -f /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/share/man/man1/python3.1
(cd /home/hpages/R/R-3.6.r76454/library/basilisk/inst/python/share/man/man1; ln -s python3.7.1 python3.1)
if test "xupgrade" != "xno"  ; then \
    case upgrade in \
        upgrade) ensurepip="--upgrade" ;; \
        install|*) ensurepip="" ;; \
    esac; \
    LD_LIBRARY_PATH=/home/hpages/github/LTLA/basilisk/src/Python-3.7.0:/home/hpages/R/R-3.6.r76454/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server:/home/hpages/R/R-3.6.r76454/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server ./python -E -m ensurepip \
        $ensurepip --root=/ ; \
fi
Traceback (most recent call last):
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/ensurepip/__main__.py", line 5, in <module>
    sys.exit(ensurepip._main())
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/ensurepip/__init__.py", line 204, in _main
    default_pip=args.default_pip,
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/ensurepip/__init__.py", line 117, in _bootstrap
    return _run_pip(args + [p[0] for p in _PROJECTS], additional_paths)
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/ensurepip/__init__.py", line 27, in _run_pip
    import pip._internal
  File "/tmp/tmpfmgyofri/pip-10.0.1-py2.py3-none-any.whl/pip/_internal/__init__.py", line 42, in <module>
  File "/tmp/tmpfmgyofri/pip-10.0.1-py2.py3-none-any.whl/pip/_internal/cmdoptions.py", line 16, in <module>
  File "/tmp/tmpfmgyofri/pip-10.0.1-py2.py3-none-any.whl/pip/_internal/index.py", line 25, in <module>
  File "/tmp/tmpfmgyofri/pip-10.0.1-py2.py3-none-any.whl/pip/_internal/download.py", line 39, in <module>
  File "/tmp/tmpfmgyofri/pip-10.0.1-py2.py3-none-any.whl/pip/_internal/utils/glibc.py", line 3, in <module>
  File "/home/hpages/github/LTLA/basilisk/src/Python-3.7.0/Lib/ctypes/__init__.py", line 7, in <module>
    from _ctypes import Union, Structure, Array
ModuleNotFoundError: No module named '_ctypes'
Makefile:1122: recipe for target 'install' failed
make[1]: *** [install] Error 1
make[1]: Leaving directory '/home/hpages/github/LTLA/basilisk/src/Python-3.7.0'
Makevars:6: recipe for target 'copying' failed
make: *** [copying] Error 2
ERROR: compilation failed for package ‘basilisk’
* removing ‘/home/hpages/R/R-3.6.r76454/library/basilisk’
LTLA commented 4 years ago

That seems to be the libffi dependency that I was referring to. I ran into it when I started testing this out, just apt-get install libffi-dev (I think, I can't quite remember).

hpages commented 4 years ago

Thanks, I got it to compile/install on my laptop (Ubuntu 16.04 LTS). Took only 2m18s. Nice.

On Mac OS X (merida1, El Capitan), it fails early with:

merida1:sandbox biocbuild$ R CMD INSTALL basilisk
* installing to library ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library’
* installing *source* package ‘basilisk’ ...
** using non-staged installation via StagedInstall field
checking for gcc... clang
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether clang accepts -g... yes
checking for clang option to accept ISO C89... none needed
downloading Python 3.7.0...
--2019-10-09 00:12:31--  https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz
Resolving www.python.org... 151.101.204.223, 2a04:4e42:30::223
Connecting to www.python.org|151.101.204.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22745726 (22M) [application/octet-stream]
Saving to: ‘Python-3.7.0.tgz’

Python-3.7.0.tgz    100%[===================>]  21.69M  93.0MB/s    in 0.2s    

2019-10-09 00:12:32 (93.0 MB/s) - ‘Python-3.7.0.tgz’ saved [22745726/22745726]

building Python...
checking build system type... x86_64-apple-darwin15.6.0
checking host system type... x86_64-apple-darwin15.6.0
checking for python3.7... python3.7
checking for --enable-universalsdk... no
checking for --with-universal-archs... no
checking MACHDEP... checking for --without-gcc... no
checking for --with-icc... no
checking for gcc... clang
checking whether the C compiler works... no
configure: error: in `/Users/biocbuild/sandbox/basilisk/src/Python-3.7.0':
configure: error: C compiler cannot create executables
See `config.log' for more details
make: *** No targets specified and no makefile found.  Stop.
configure:    PYTHON_DIR=Python-3.7.0
configure:    USER_INST_DIR=/Library/Frameworks/R.framework/Versions/3.6/Resources/library/basilisk/inst/python
configure: creating ./config.status
config.status: creating src/Makevars
** libs
mkdir -p "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/basilisk/inst/python"
cd Python-3.7.0 && make install 
make[1]: *** No rule to make target `install'.  Stop.
make: *** [copying] Error 2
ERROR: compilation failed for package ‘basilisk’
* removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/basilisk’
grimbough commented 4 years ago

It does not get very far in Windows at the moment:

* installing to library 'C:/Users/Mike Smith/Documents/R/win-library/3.6'
* installing *source* package 'basilisk' ...
** using non-staged installation via StagedInstall field

   **********************************************
   WARNING: this package has a configure script
         It probably needs manual configuration
   **********************************************

** libs
no DLL was created
ERROR: compilation failed for package 'basilisk'
* removing 'C:/Users/Mike Smith/Documents/R/win-library/3.6/basilisk'

Exited with status 1.
LTLA commented 4 years ago

Thanks guys.

@hpages I had previously modified the build set-up for Macs, but forgot to commit it... here it is, on the macbuild branch. I also had to personally tell it where to find openssl by modifying Makevars.in:

                ./configure --enable-shared=yes CC=${CC} CFLAGS="${CFLAGS}" \
                        --prefix="${USER_INST_DIR}" \
                       --exec-prefix="${USER_INST_DIR}" \
                       --with-openssl="/usr/local/opt/openssl"

I don't know whether that's a more general problem. I presume it could be solved with some autoconf-fu. I also want to see if macbuild works on linux as well, that would simplify things.

@grimbough We probably need to write a Makevars.win.

hpages commented 4 years ago

Thanks Aaron. I'll try on merida1 again today and will let you know.

LTLA commented 4 years ago

Might want to hold off for a bit; I modified the build last night so that it would work on my linux machine at home, so now I have to check again that it still works on my mac at work!

Edit: Okay, the good news is that master now works on my mac and my linux machine. As in, you can install basilisk and you can load Python from within R without problems.

The bad news is that the Mac install fails to find an appropriate version of OpenSSL, which means that the _ssl_ module doesn't build, which means that you can't download any packages from PyPi, which is a deal breaker. The nuclear option is to package a compiled libopenssl inside basilisk for Python to link to; is there a more elegant solution?

hpages commented 4 years ago

I'll look at this. I vaguely remember the OpenSSL thing being an issue at some point on our Mac builders. Can't remember what it was exactly and how we addressed it. Will need to dig around.

LTLA commented 4 years ago

Right. So to sum up the current state of play, you can (on Linux):

  1. Install basilisk.
  2. Go to inst and install son.of.basilisk.
  3. Run:
library(son.of.basilisk)
test()

I've switched to using virtual environments, which means that each Bioc package can define its own desired suite of Python package versions to use. Especially cool is the fact that you can define multiple environments within and across R packages, so you can have one R package that can use functionality from mutually incompatible versions of Python packages!

The Mac version is currently broken because of OpenSSL. Hoping @hpages has some ideas here.

I have no ideas on what to do with Windows. @grimbough seems to have stuffed the static libraries for Rhdf5lib in the Git repository itself. Was this the only way?

grimbough commented 4 years ago

If you can build the binaries using the standard R Windows toolchain then I think it's fine to do that via Makevars.win. I couldn't reliably get libhdf5 to build on Windows using only the autoconf/Make approach, I had to use CMake & some custom edits to the source, so I opted for just distributing the binaries after a manual build.

I think there's a middle ground where the BioC build system has CMake installed, so you can compile there, but not expect most users to be able to 'install-from-source'.

I'll see how far it gets if I just copy the commands into Makevars.win.

grimbough commented 4 years ago

I've spent quite a while trying to get package to download and install the embeded version of Python, since that's just a zip file with pre-compiled Python.exe inside.

However I've hit trouble installing support for virtualenv and now realised that the embeded version inclulde the text "Using pip to manage dependencies as for a regular Python installation is not supported with this distribution"

Seems like this a bit of a deal breaker to this approach, but I don't use python often enough to say for certain. Any thoughts?


Also, I think https://github.com/LTLA/basilisk/blob/8819464e47044e8c424691fb7207e21d0e76769e/R/useVirtualEnv.R#L81 should have .libPaths()[1] or similar, otherwise it throws an error when you have multiple library paths.

LTLA commented 4 years ago

I wonder whether the miniconda stuff in the linked issue above might help here. We just rely on miniconda (via rminiconda or otherwise) to give us a usable Python installation on all systems, which would avoid the need for having to package it ourselves. It seems to work for me by using:

 useBiocPython <- function() {
     use_python(rminiconda::find_miniconda_python(), required=TRUE)
 }

If this also works on Windows, it might be the most expedient solution.

LTLA commented 4 years ago

The miniconda branch has my best shot at a configure.win, which piggy-backs off miniconda to manage the installation.

hpages commented 4 years ago

Sorry for the slow answer. I can reproduce the SSL error on merida1 (El Capitan). When installing basilisk with R CMD INSTALL basilisk, the configure script displays:

...
checking whether compiling and linking against OpenSSL works... no
...

even though we have a working installation of OpenSSL on the machine (that we installed with brew install openssl). Interestingly we also used brew to install the system wide Python3 on this machine and this also installed the ssl module so it seems that at least the brew formula managed to find the brew-installed openssl. FWIW the formula is here.

The libssl shipped with El Capitan is v 0.9 and is in /usr/lib/. The brew-installed one is v 1.0 and is in /usr/local/Cellar/openssl/1.0.2r/lib/.

Note that Python's configure script lets the user specify the root of the OpenSSL directory via the --with-openssl option so I was hoping I'd be able to point basilisk's configure script to the brew-installed openssl with something like R CMD INSTALL basilisk --configure-args='--with-openssl=/usr/local/Cellar/openssl/1.0.2r' but that doesn't seem to work (I get configure: WARNING: unrecognized options: --with-openssl). Maybe basilisk's configure script could forward its arguments to Python's configure script? Still won't be entirely satisfying though if the ultimate goal is to provide a just-work/hassle-free experience.

Otherwise, yes, one option would be to include your own libssl in basilisk or to wrap libssl in an Rssllib package (CRAN package openssl doesn't include libssl so cannot be used for that).

H.

LTLA commented 4 years ago

Yeah, the configure script failed to find my brew-installed OpenSSL as well. Don't know why.

Anyway, the way ahead might be to just use Miniconda to deliver a Python installation + all of its libraries into an R-controlled directory that is isolated from the system Python. The miniconda branch works on my Linux and Mac machines, waiting to see if it plays nice on Windows as well.

hpages commented 4 years ago

I went on merida1 again to test the miniconda branch of basilisk and everything seemed to work as expected (I was able to install son.of.basilisk and to run son.of.basilisk::test().)

However, on my laptop (Ubuntu 16.04 LTS), even though I can install the miniconda branch of basilisk, installing son.of.basilisk then fails:

hpages@spectre:~/github/LTLA/basilisk/inst$ R CMD INSTALL example
* installing to library ‘/home/hpages/R/R-3.6.r76454/library’
* installing *source* package ‘son.of.basilisk’ ...
** using non-staged installation via StagedInstall field
Creating virtual environment 'env1' ...
Using python: /usr/bin/python2.7
Running virtualenv with interpreter /usr/bin/python2
New python executable in /home/hpages/R/R-3.6.r76454/library/son.of.basilisk/inst/basilisk/env1/bin/python2
Also creating executable in /home/hpages/R/R-3.6.r76454/library/son.of.basilisk/inst/basilisk/env1/bin/python
...
...
ERROR: Could not find a version that satisfies the requirement scikit-learn==0.21.0 (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2)
ERROR: No matching distribution found for scikit-learn==0.21.0
Error: Error installing package(s): 'scikit-learn==0.21.0'
Execution halted
ERROR: configuration failed for package ‘son.of.basilisk’
* removing ‘/home/hpages/R/R-3.6.r76454/library/son.of.basilisk’

Looks like it's picking up the wrong Python.

Also I should have mentioned that rtracklayer does find the brew-installed openssl (see here) so no reason a priori the master branch of basilisk could not.

LTLA commented 4 years ago

However, on my laptop (Ubuntu 16.04 LTS), even though I can install the miniconda branch of basilisk, installing son.of.basilisk then fails:

Huh. That's weird. What happens if you do something like:

library(basilisk)
useBiocPython()
reticulate::py_config()

? I get:

python:         /Users/luna/Software/R/R-3-6-branch-dev/library/basilisk/inst/miniconda/bin/python3
libpython:      /Users/luna/Software/R/R-3-6-branch-dev/library/basilisk/inst/miniconda/lib/libpython3.7m.dylib
pythonhome:     /Users/luna/Software/R/R-3-6-branch-dev/library/basilisk/inst/miniconda:/Users/luna/Software/R/R-3-6-branch-dev/library/basilisk/inst/miniconda
version:        3.7.3 (default, Mar 27 2019, 16:54:48)  [Clang 4.0.1 (tags/RELEASE_401/final)]
numpy:           [NOT FOUND]

NOTE: Python version was forced by use_python function

Maybe I should explicitly set the path in virtualenv_create...

LTLA commented 4 years ago

Right, 862c098860584660ff59f583c43635728cef348b contains an explicitly set python= during creation of the virtual environments, so now son.of.basilisk should have no excuse for not working...

hpages commented 4 years ago

So before or after commit 862c0988 reticulate::py_config() is still picking the wrong Python for me:

library(basilisk)
useBiocPython()
reticulate::py_config()
# python:         /usr/bin/python
# libpython:      /usr/lib/python2.7/config-x86_64-linux-gnu/libpython2.7.so
# pythonhome:     /usr:/usr
# version:        2.7.12 (default, Oct  8 2019, 14:14:10)  [GCC 5.4.0 20160609]
# numpy:          /usr/lib/python2.7/dist-packages/numpy
# numpy_version:  1.11.0
#
# NOTE: Python version was forced by RETICULATE_PYTHON
# Warning message:
# Python '/home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda/bin/python3' was requested
# but '/usr/bin/python' was loaded instead (see reticulate::py_config() for more information) 

Looks like my problem is that I had the RETICULATE_PYTHON environment variable set to /usr/bin/python (must be from some old experimentation with reticulate, didn't even remember that). Not clear to me why use_python() would give precedence to the variable over the path explicitly passed to it though. Doesn't make a lot of sense to me.

Anyway, the following (somewhat hacky) change seems to help in my case:

hpages@spectre:~/github/LTLA/basilisk$ git diff
diff --git a/R/useBiocPython.R b/R/useBiocPython.R
index dc32d69..f071f3f 100644
--- a/R/useBiocPython.R
+++ b/R/useBiocPython.R
@@ -19,5 +19,6 @@
 #' @export
 #' @importFrom reticulate use_python
 useBiocPython <- function() {
+    Sys.unsetenv("RETICULATE_PYTHON")
     use_python(system.file("inst", "miniconda", "bin", "python3", package="basilisk", mustWork=TRUE), required=TRUE)
 }

Then:

library(basilisk)
useBiocPython()
reticulate::py_config()
# python:         /home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda/bin/python3
# libpython:      /home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda/lib/libpython3.7m.so
# pythonhome:     /home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda:/home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda
# version:        3.7.3 (default, Mar 27 2019, 22:11:17)  [GCC 7.3.0]
# numpy:           [NOT FOUND]
#
# NOTE: Python version was forced by use_python function

So far so good.

Then I can install son.of.basilisk but then:

library(son.of.basilisk)
test()
# Error: callr subprocess failed: ImportError: No module named pandas

Not sure what to do next.

LTLA commented 4 years ago

Was there anything interesting in the son.of.basilisk installation logs? The virtual environments are currently created upon installation, so any error messages would have occurred there.

hpages commented 4 years ago

Full log:

hpages@spectre:~/github/LTLA/basilisk/inst$ R CMD INSTALL example
* installing to library ‘/home/hpages/R/R-3.6.r76454/library’
* installing *source* package ‘son.of.basilisk’ ...
** using non-staged installation via StagedInstall field
Creating virtual environment 'env1' ...
Using python: /home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda/bin/python3
Collecting pip
  Using cached https://files.pythonhosted.org/packages/4a/08/6ca123073af4ebc4c5488a5bc8a010ac57aa39ce4d3c8a931ad504de4185/pip-19.3-py2.py3-none-any.whl
Collecting wheel
  Using cached https://files.pythonhosted.org/packages/00/83/b4a77d044e78ad1a45610eb88f745be2fd2c6d658f9798a15e384b7d57c9/wheel-0.33.6-py2.py3-none-any.whl
Collecting setuptools
  Using cached https://files.pythonhosted.org/packages/6a/9a/50fadfd53ec909e4399b67c74cc7f4e883488035cfcdb90b685758fa8b34/setuptools-41.4.0-py2.py3-none-any.whl
Installing collected packages: pip, wheel, setuptools
Successfully installed pip-19.3 setuptools-41.4.0 wheel-0.33.6
Using virtual environment 'env1' ...
Collecting pandas==0.25.1
  Using cached https://files.pythonhosted.org/packages/7e/ab/ea76361f9d3e732e114adcd801d2820d5319c23d0ac5482fa3b412db217e/pandas-0.25.1-cp37-cp37m-manylinux1_x86_64.whl
Collecting numpy>=1.13.3
  Using cached https://files.pythonhosted.org/packages/ba/e0/46e2f0540370f2661b044647fa447fef2ecbcc8f7cdb4329ca2feb03fb23/numpy-1.17.2-cp37-cp37m-manylinux1_x86_64.whl
Collecting pytz>=2017.2
  Using cached https://files.pythonhosted.org/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl
Collecting python-dateutil>=2.6.1
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Collecting six>=1.5
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: numpy, pytz, six, python-dateutil, pandas
Successfully installed numpy-1.17.2 pandas-0.25.1 python-dateutil-2.8.0 pytz-2019.3 six-1.12.0
Creating virtual environment 'env2' ...
Using python: /home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda/bin/python3
Collecting pip
  Using cached https://files.pythonhosted.org/packages/4a/08/6ca123073af4ebc4c5488a5bc8a010ac57aa39ce4d3c8a931ad504de4185/pip-19.3-py2.py3-none-any.whl
Collecting wheel
  Using cached https://files.pythonhosted.org/packages/00/83/b4a77d044e78ad1a45610eb88f745be2fd2c6d658f9798a15e384b7d57c9/wheel-0.33.6-py2.py3-none-any.whl
Collecting setuptools
  Using cached https://files.pythonhosted.org/packages/6a/9a/50fadfd53ec909e4399b67c74cc7f4e883488035cfcdb90b685758fa8b34/setuptools-41.4.0-py2.py3-none-any.whl
Installing collected packages: pip, wheel, setuptools
Successfully installed pip-19.3 setuptools-41.4.0 wheel-0.33.6
Using virtual environment 'env2' ...
Collecting scikit-learn==0.21.0
  Using cached https://files.pythonhosted.org/packages/ed/93/d4b622a8d15c9a1bce4d50c4875c872dde534d335e574ef48a50ee999d82/scikit_learn-0.21.0-cp37-cp37m-manylinux1_x86_64.whl
Collecting numpy>=1.11.0
  Using cached https://files.pythonhosted.org/packages/ba/e0/46e2f0540370f2661b044647fa447fef2ecbcc8f7cdb4329ca2feb03fb23/numpy-1.17.2-cp37-cp37m-manylinux1_x86_64.whl
Collecting joblib>=0.11
  Using cached https://files.pythonhosted.org/packages/8f/42/155696f85f344c066e17af287359c9786b436b1bf86029bb3411283274f3/joblib-0.14.0-py2.py3-none-any.whl
Collecting scipy>=0.17.0
  Using cached https://files.pythonhosted.org/packages/94/7f/b535ec711cbcc3246abea4385d17e1b325d4c3404dd86f15fc4f3dba1dbb/scipy-1.3.1-cp37-cp37m-manylinux1_x86_64.whl
Installing collected packages: numpy, joblib, scipy, scikit-learn
Successfully installed joblib-0.14.0 numpy-1.17.2 scikit-learn-0.21.0 scipy-1.3.1
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (son.of.basilisk)
LTLA commented 4 years ago

I bet that, in the new R session created by callr, the RETICULATE_PYTHON variable is still set, resulting in a reversion to the same error that we have already observed. Could you try adding a Sys.unsetenv call in useVirtualEnv() and see if that helps?

(You might ask why we create a new R session to call Python. Recall that reticulate only supports one version of Python in a single R session, and specifically, only one version of any given module can be loaded. If two packages rely on incompatible module versions... well, that's too bad. By doing Python operations in a separate R process, we guarantee that any given package can operate successfully regardless of how other packages are behaving.)

hpages commented 4 years ago

Yup. After adding Sys.unsetenv("RETICULATE_PYTHON") to useVirtualEnv() everything seems to work as expected.

Something I didn't notice before is that during R CMD INSTALL basilisk I get:

hpages@spectre:~/github/LTLA$ R CMD INSTALL basilisk
* installing to library ‘/home/hpages/R/R-3.6.r76454/library’
* installing *source* package ‘basilisk’ ...
** using non-staged installation via StagedInstall field
trying URL 'https://repo.anaconda.com/miniconda/Miniconda3-4.7.10-Linux-x86_64.sh'
...
...
Executing transaction: - WARNING conda.core.envs_manager:register_env(46): Unable to register environment. Path not writable or missing.
  environment location: /home/hpages/R/R-3.6.r76454/library/basilisk/inst/miniconda
  registry file: /home/hpages/.conda/environments.txt
done
installation finished.
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (basilisk)

I don't have a /home/hpages/.conda/ folder (and even less a /home/hpages/.conda/environments.txt file) so indeed my registry file is "not writable or missing". Can the warning safely be ignored?

I'm leaving the computer now. Done for today.

LTLA commented 4 years ago

Can the warning safely be ignored?

I think so... I've been ignoring it. Don't know what that's about either.

LTLA commented 4 years ago

@hpages could you do me a favor and make a PR to master with your changes?

It would also be desirable to wrap it in:

old <- Sys.getenv("RETICULATE_PYTHON")
Sys.unsetenv("RETICULATE_PYTHON")
on.exit(Sys.setenv(RETICULATE_PYTHON=old))

... which I would do myself, but I don't know whether that would re-introduce the problem.

hpages commented 4 years ago

Yes it does re-introduce the problem. Looks like there are other places in your code that benefit from the fact that we've unset RETICULATE_PYTHON earlier. So there are probably a few other functions in basilisk (and maybe also in son.of.basilisk?) that would need to start with that 3-liner. I don't really have enough knowledge of the basilisk/son.of.basilisk guts to try to figure out where those places are, sorry. Do you still want a PR with just the Sys.unsetenv("RETICULATE_PYTHON") line added to useBiocPython() and useVirtualEnv()?

LTLA commented 4 years ago

I added unsetenvs all over the place, hopefully it will work better now.

hpages commented 4 years ago

works for me now

LTLA commented 4 years ago

Alright, it's showtime again. I think I addressed all the comments from #2 and now it's lean and mean on Linux and Mac. @grimbough can you try checking on Windows? It'll probably fail...

grimbough commented 4 years ago

The file name for the Windows miniconda binary needs to be changed to:

inst_file <- sprintf("Miniconda3-%s-Windows-%s.exe", version, arch)

After that building seems to work and it downloads the binary (an I presume executes it), but it doesn't seem to install in the correct place. When R tries to load the package after it's been built I get an error, which I think the pertinent bits are:

...
** testing if installed package can be loaded
adding rname 'https://repo.anaconda.com/miniconda/Miniconda3-4.7.12-Windows-x86_64.exe'
Error: package or namespace load failed for 'basilisk':
 .onLoad failed in loadNamespace() for 'basilisk', details:
  call: system.file(.mc_dir, package = "basilisk", mustWork = TRUE)
  error: no file found
Error: loading failed
Execution halted
...
LTLA commented 4 years ago

Thanks Mike. I changed the installer name but I don't know why it's not installing in the right place. Do you know where it's going instead? What's the value of inst_args on your machine?

LTLA commented 4 years ago

@hpages I forsee some problems with the production of the binaries for basilisk and its client packages. If R literally tar -czf's the installation directory, it will include the constructed miniconda installation and/or its virtual environment. This causes a number of problems:

Is there a way to tell R to exclude certain directories from the binary? I never build binaries, so I don't know if .Rbuildignore would work for this purpose.

grimbough commented 4 years ago

Thanks Mike. I changed the installer name but I don't know why it's not installing in the right place. Do you know where it's going instead? What's the value of inst_args on your machine?

I think the installation was failing because system2() required paths to have the correct path separators e.g. c:\\foo rather than c:/foo. It seems to run the installation after that.

However I then run into issues with pip.exe not being found in the expected place when calling virtualenv_create(). I think this is more a reticulate issue, but I need to spend more time digging to understand where the incorrect paths are being constructed.

LTLA commented 4 years ago

Does it just need \\ for paths supplied as arguments, or does it need it for the executable as well?

reticulate does a lot of system2 internally. In fact, I notice that they deliberately normalize their paths to use / on Windows (see reticulate:::virtualenv_default_python, for example). This may not be conducive to correct function if system2 is expecting backslashes.

hpages commented 4 years ago

Hi Aaron,

Don't know if there is a way to tell R to exclude certain directories from the binary. Never had this need before

It's easy to build a package binary on Linux:

R CMD INSTALL --build <pkg>

<pkg> can be a source tarball or a package source tree. It's basically the same command on Windows except that you need to add the --merge-multiarch flag (the flag is only effective for packages with native code but it doesn't hurt to always use it).

On Mac things are more complicated, especially if the package contains native code. We use the following script on our Mac builders: https://github.com/Bioconductor/BBS/blob/master/utils/build-universal.sh The script is a slightly modified version of a script that Simon Urbanek uses to produce the Mac package binaries on CRAN.

You can see the exact command we use on the Windows and Mac builders here:

https://bioconductor.org/checkResults/3.11/bioc-LATEST/beachmat/tokay2-buildbin.html https://bioconductor.org/checkResults/3.11/bioc-LATEST/beachmat/celaya2-buildbin.html

I wonder about having basilisk's clients installing stuff under basilisk's installation folder. Would this happen during the installation of the client or later when code in the client actually needs to access some Python modules? It's probably fine if it's the former (granted concurrent attempts to install things are handled gracefully e.g. in the context of parallel installation) but the latter could be problematic. I've always seen package installation directories treated as read-only places (except during the installation process itself of course). FWIW it's a big "NO" during package review if we see that a package is writing to its own installation directory. Would be an even bigger "NOOOOO" if it were writing to the installation directory of another package. The CRAN people sometimes put the package library folder on a read-only partition to run the CRAN checks just to catch those packages that try to write there. We've been considering doing the same thing for a while.

LTLA commented 4 years ago

Would be an even bigger "NOOOOO" if it were writing to the installation directory of another package.

Oh, you definitely won't like how basilisk works now, then.

I had initially stuffed things in the installation directory because I thought a frozen Python instance would be added during package installation, and that it would never change, much like a native library. Then things got out of hand with lazy installation, etc.

I could probably switch this easily to use rappdirs, which would (i) avoid packaging the Python instance or venvs into the tarball, and (ii) avoid problems with the deletion of lazy history when basilisk is reinstalled but its clients are not.

However, that introduces its own sequence of headaches related to critical content being in a separate directory, e.g., whether the selected directory for the installing process is the same as the selected directory for the process that ultimately uses the package. It's not clear that it would be, so I'd need to add a level of redirection to ensure I get the correct path.

vjcitn commented 4 years ago

Does it make any sense to manage this process with BiocFileCache? It seems to boil down to the placement of software components which are files, and management of paths related to those files.

hpages commented 4 years ago

Alternatively the simple/straightforward way to go is to have basilisk install the "core" Python modules and to NOT delay that. 624M of stuff is not a big deal (many BSgenome data packages are bigger than that) even if it might grow to 1G in the future. This would stick to the "installation folders are treated as read-only" paradigm and thus would avoid many potential complications. Sounds like a KISS situation to me.

hpages commented 4 years ago

OTOH if all this stuff is not going to be bundled in the binary then downloading/installing this stuff cannot be part of the basilisk installation procedure (installation of a binary package is basically just extracting a tarball or zip file at the right location, no post-install hook is triggered AFAIK). Which means that it would need to happen later (but not at load-time please). Which means it should go to a place that is not under the package installation folder (to stick to the "installation folders are treated as read-only" paradigm). All this is if all the stuffing (miniconda + core modules) is not going to be bundled in the binary.

LTLA commented 4 years ago

Does it make any sense to manage this process with BiocFileCache?

Not really, I need a directory rather than paths to individual files. I don't think BFC handles that.

Alternatively the simple/straightforward way to go is to have basilisk install the "core" Python modules and to NOT delay that.

This could work, but tar gives me endless complaints that "storing paths of more than 100 bytes is not portable". I don't know how much of a real issue this might be. I'm also uncertain of how reliably tar will package up symbolic links in client packages (very important that they are symbolic and not the file copies, otherwise the core packages will not be correctly found).

hpages commented 4 years ago

Yep I have these "storing paths of more than 100 bytes is not portable" warnings too: https://bioconductor.org/checkResults/3.11/bioc-LATEST/BSgenome/malbec2-buildsrc.html AFAIK they've never caused problems (sounds like an overzealous warning from utils::tar, the R builtin tar, because I don't get the warning when I use the Unix tar command in a terminal).

LTLA commented 4 years ago

Switching to an Anaconda installer is very satisfactory but R CMD INSTALL --build has been printing out these warnings... for an hour. Literally, it's just been spending all this time printing out warnings. Manually wrapping the basilisk installation directory into a tarball is done in 2-3 minutes but gives me a 1.2 GB monster, which I suppose is not too bad...

If we can't turn off the warnings, it's not going to build in a reasonable time, which would require us to use an external directory for the files. This could follow a BiocFileCache-like model where client packages set up a basilisk instance at run time, and the set-up is a no-op if an instance is already there. But on servers, every individual user will be pulling down their own Anaconda instance rather than using a centrally installed instance... and that would suck.

I guess I could add an environment variable directing basilisk to look for content in a user/sysadmin-specified location first before creating an new instance.

hpages commented 4 years ago

You can try to bring the issues with utils::tar (warnings + inefficiency) to the R folks. We still have a few months to try to get things pushed to R 4.0 before the BioC 3.11 release. A 1.2 GB download to get a fully integrated Python + core modules into the Bioconductor ecosystem without the end-user having to know anything about Python still sounds like a deal to me. I take it (especially since this also means zero Python-related installation/maintenance/headache on the build machines ;-) )

LTLA commented 4 years ago

Who's in charge of tar? Or I guess I just post on R-devel. Aw gee, I don't want to go into that shark tank.

LTLA commented 4 years ago

I notice that we could set the tar environment variable, which should be picked up by utils::tar. So perhaps if we point this to a less whiny system tar, it would solve our problems right away.

LTLA commented 4 years ago

Looks like this works. Would this be a palatable change for the BioC build system?

export R_INSTALL_TAR=tar
LTLA commented 4 years ago

Successful builds in Bioconductor/Contributions#1318 indicate that this is indeed feasible.