Bioconductor / basilisk

Clone of the Bioconductor repository for the basilisk package.
https://bioconductor.org/packages/devel/bioc/html/basilisk.html
GNU General Public License v3.0
27 stars 14 forks source link

Checking in #2

Open LTLA opened 4 years ago

LTLA commented 4 years ago

@kevinushey:

@mtmorgan suggested that you might already have some ideas/implementations for the delivery of a consistent version of Python to reticulate users. If this is possible, it would certainly be helpful for our efforts to reliably wrap and deploy Python functionality inside Bioconductor packages.

Our current approach in basilisk basically uses a sledgehammer to kill a mouse. We drag in a miniconda installation of Python into ${R_HOME}/library/basilisk/inst upon installation of basilisk, thus providing a self-contained Python installation owned by R. Each downstream BioC package that wants to use this Python instance sets up its own virtual environment upon its own installation, pulling down and installing whatever Python packages are necessary. Execution of Python code takes place in a separate R process via callr, thus allowing multiple R packages relying on incompatible versions of Python packages to be used in the same analysis session.

Perhaps you might have something that's more elegant? Our main focus is on the operation of Python code within R packages, with developers being the intended audience.

kevinushey commented 4 years ago

We've actually been working on this in reticulate right now as well!

Right now, our intention is to download and install Miniconda onto the user's filesystem (using a path from rappdirs), which can then be discovered and used by reticulate as appropriate. The intention is that this Miniconda installation will be used by users who:

  1. Don't want to worry about configuring Python to use a particular installation; and
  2. Package authors using reticulate, who want a well-behaved Python installation that they can develop against.

The ultimate goal is to insulate the user from having to worry about managing a Python installation -- that is, R packages using reticulate should be able to behave like any other R package, without extra cognitive overhead for setup or maintenance.

We're also hoping will be able to work with a single, shared Python environment, with the option later for creating and using project-specific Python environments as necessary. This implies having client packages of reticulate minimize the number of conflicts in their dependencies. (In particular, we won't support Python 2.7 in this mode)

Once concern I have with your approach is that it becomes difficult to switch to different versions of Python for users -- one is effectively stuck with the version that was built with basilisk. It also implies that any projects using 'basilisk' would become unusable if 'basilisk' were uninstalled. However, this approach may be indeed appropriate for Bioconductor, where releases are generally frozen (other than urgent bugfixes).

One other worry: if 'basilisk' were to upgrade the bundled version of Python, I think this would also imply upgrading any pre-existing virtual environments (since those are often created using symlinks back to the "real" Python installation). This may or may not be the desired behavior for users. Perhaps this is okay if Python version bumps would only occur with R version bumps as well?

Execution of Python code takes place in a separate R process via callr, thus allowing multiple R packages relying on incompatible versions of Python packages to be used in the same analysis session.

Doesn't this imply losing the main benefits of reticulate; that is, having a persistent Python session that can be worked against? I also worry this could be quite expensive for serialization (large objects that need to be transmitted from session to session would be slow).

LTLA commented 4 years ago

Thanks Kevin;

Right now, our intention is to download and install Miniconda onto the user's filesystem (using a path from rappdirs), which can then be discovered and used by reticulate as appropriate.

That sounds great; that would save me from having to manage that particular bit of functionality. If there are also advanced options to control the path, that would be perfect for us.

However, this approach may be indeed appropriate for Bioconductor, where releases are generally frozen (other than urgent bugfixes).

Yes, it would require a bit of discipline on our end - but we already manage coordinated versions of Bioconductor core packages across the project, so it is not much of a stretch to say that "BioC packages are strongly recommended to use this Python version".

I guess each package could also install their own Python version if they wanted, but I figured we could save a bit of space and time by having a single installation for everyone rather than dealing with twenty different miniconda installations scattered around the place.

At the time, I was also thinking that every R package would also use the same Python session (and thus should use the same Python version) but this doesn't matter as much now.

Perhaps this is okay if Python version bumps would only occur with R version bumps as well?

Yes, that's right. And additionally, the Bioconductor core team has the ability to bump all dependent packages to trigger recompilation in case of updates to core packages. This is not infrequent when dealing with changes to the S4 method table, and I've had to request this a few times when updating some C++ APIs that are linked to by other packages.

Doesn't this imply losing the main benefits of reticulate; that is, having a persistent Python session that can be worked against?

Yes. I think it depends on how you view this capability. From a package developer viewpoint, I have been thinking of using Python in a manner similar to the .Call interface for C/C++ code. I don't expect any intermediate objects to persist in memory, I just want them back in R after the calculation is finished. The purpose is really just to call out to specific functions in Python modules and then do the rest of the work back in R.

I also worry this could be quite expensive for serialization (large objects that need to be transmitted from session to session would be slow).

That's a good point. I guess I should add some code checking (i) whether a Python instance is already loaded, (ii) use the virtual environment directly in the current process if not, but (iii) spin up a new process if the loaded instance is not compatible with the one that I want.

kevinushey commented 4 years ago

Yes. I think it depends on how you view this capability. From a package developer viewpoint, I have been thinking of using Python in a manner similar to the .Call interface for C/C++ code. I don't expect any intermediate objects to persist in memory, I just want them back in R after the calculation is finished. The purpose is really just to call out to specific functions in Python modules and then do the rest of the work back in R.

Package authors may want to retain references to Python objects within the R session, though -- even if they only expose R objects to their users. As an example, you could imagine an R package that maintains a reference to e.g. an HTTP server implemented with Python, with the R package providing an interface to that server. In that case, the Python object would need to remain persistent.

IMHO if you build any tooling around reticulate, it should be around the assumption that client packages of reticulate will expect Python objects to be able to persist as long as the R session itself persists as well.

That said, having an API to execute some code in a reticulate session separate from the current R process could still be useful; I just don't see it being the primary way users or package authors would want to use reticulate.

LTLA commented 4 years ago

That's an interesting perspective. I hadn't realized that the Python objects were persistent like that.

If I may throw another opinion into the ring: for the packages I develop, I never assume any level of persistence between function calls. The general philosophy is to avoid side effects so as to make my life easier; the basilisk way of using Python code reflects this mentality. (I don't know whether global variables are particularly common in Python packages, but persistence could make R package debugging harder if those globals can be modified by reticulate operations in other packages.)

The practical reason for using a separate R process was to accommodate different (and possibly mutually incompatible) Python package dependencies from different R packages within a single R analysis session. This would ensure that R packages calling some Python code would work regardless of what else was going on, e.g., what other packages were run before that, or what version of Python was loaded manually by the user. If you know of a better way of doing this, then that would be great - but it seems like it would be very difficult to swap out one set of Python modules for another "on the fly", especially given that the objects need to persist in the same Python session.

And yes, it's true that basilisk won't be able to handle cases where persistence is necessary. I'll admit that's not a thing that I was considering for my packages. But I don't know of a solution that allows persistence and still guarantees that the R packages will work reliably.

Tell you what - if you think this "separate process" idea is too niche, that's fine and understandable. I can just piggy-back off your miniconda deployment code (provided I can control the installation path). basilisk will probably have to exist to standardize the Python installation across BioC packages anyway, and at least within the project, I can imagine a few governance things we can do to encourage people to use compatible Python packages - with the separate processes being a fallback.

kevinushey commented 4 years ago

I do think that client packages of reticulate will expect that a persistent Python session is available and can be accessed from the same R session.

Handling mutually incompatible Python dependencies is definitely the main challenge in this approach, but I believe it can be at least mitigated if we document for R package authors that they should aim for a common target (e.g. Python 3.6).

LTLA commented 4 years ago

Okay, thanks. I guess I was hoping for a nice solution that would guarantee that Python-dependent functions in client packages can work regardless of what has already happened in the R session... oh well. I'll ask around the BioC developers to see what they feel about the persistence vs. isolation trade-off that basilisk makes compared to raw reticulate - see if there's any appetite for this...

LTLA commented 4 years ago

Thanks to @lawremi, the persistence problem has been solved by simply setting up a separate R process. This retains the advantage of isolation, as discussed above, while allowing the process to last as long as the client package wants it to. I suspect most people will just on.exit(stopCluster(cl)), but it's possible to persist across functions if that is deemed to be important.

There is still the cost of serialization, but it should be possible to mitigate this by forking where possible. For Windows users... ¯\_(ツ)_/¯. But it'll still work.

mtmorgan commented 4 years ago

From my understanding of this discussion, and from admittedly superficial day-to-day experience with python, my impression is that R users treat pythonic code as scripts rather than library calls, and as such persistence and interoperability (between ad hoc collections of python modules) is an important goal. I'll tag @simon-anders (if he's on github these days...)

lawremi commented 4 years ago

My understanding is that basilisk has two goals: (1) define a common Python environment for all of Bioconductor, (2) enable multiple, potentially conflicting Python environments to exist within a single R session. I think these two goals could be achieved with separate packages.

For (1), it seems that basilisk is just specifying (and providing) a single version of Python itself. Ideally it would specify (and potentially provide) a broader environment including the versions of popular Python extension modules. This would be the official Python environment of Bioconductor and serve to minimize conflicts. This is probably just a simple wrapper around miniConda.

For (2), there should be an abstraction layer that enables packages to transparently interact with Python (via reticulate), whether Python is running in the current session (using the official Bioconductor environment), or whether Python is running in a remote session (using an alternative, conflicting environment). This seems like a reasonable compromise that enables full compatibility with any Python code while being efficient for the common, standard case. Ideally reticulate itself would provide this flexibility.

kevinushey commented 4 years ago

From my understanding of this discussion, and from admittedly superficial day-to-day experience with python, my impression is that R users treat pythonic code as scripts rather than library calls, and as such persistence and interoperability (between ad hoc collections of python modules) is an important goal.

That's right. At least, from the perspective of reticulate, its main features are:

  1. Python is embedded directly in your R session (it's just loaded as a library), so moving objects between R and Python is cheap and simple, compared to solutions using shared memory or interprocess communication.

  2. The Python objects are persistent, and so can be used and manipulated in exactly the same way R users would expect of R objects.

  3. reticulate provides an API that allows users to (as much as possible) treat objects as though they were regular R objects, so that you can manipulate Python objects using the same kind of code you would use for R.

I strongly agree with both of the points raised by @lawremi. I think having basilisk focused on solving (1) would be the primary goal; (2) is more of an escape hatch for cases where (1) is not sufficient.

We'll have to think about how (2) could be handled on the reticulate side, if we choose to go down that road.

LTLA commented 4 years ago

Yes, if (2) could be handled on the reticulate side, that would make life easier. In fact, if (1) and (2) are provided by reticulate, there wouldn't be any need for a dedicated package like basilisk at all, and we could just have a Python integration style guide on the BioC website somewhere.

Of course, if there is no appetite for (2), then I guess basilisk would need to exist, but that's okay; it wouldn't be much work to slap some tests and docs on it for BioC submission. I'll go and collect some feedback from developers of potential client packages before I make a go/no-go decision.

kevinushey commented 4 years ago

I think it still makes sense for (1) to be solved by a package like basilisk; especially since it takes care of making sure Python is available in the binaries that would be distributed on Windows + macOS, and also freezes the version of Python for a particular version of the package. reticulate takes a separate approach and instead installs Miniconda on demand (with user consent), and in general tries to be more flexible in terms of what particular version of Python can be used.

It is likely out-of-scope for reticulate to bundle a Python installation -- in particular, it feels unlikely that CRAN would accept the package if it did so, due to the large binary size. But I could be wrong.

mtmorgan commented 4 years ago

Why would this be a Bioconductor package? Seems like the main consequence of that would be to enable Bioc developers to go off in their own incompatible direction...

One thing I'd hope to get from this iteration is a unified solution that allows the Bioc (and CRAN?) builders to actually run python code in the same way that we do R code during build / check. We can't do that if there are plurality of solutions, or if solutions are intrinsically fragile or resource-intensive. I believe also that even the major python-based CRAN packages do not evaluate python code, which undermines a major strength of the system...

Kind of a weird discussion for an issue on a package that's still under development... ;)

lawremi commented 4 years ago

I could see a case for solution (1) to be in Bioconductor, because it would give Bioconductor control over the Python environment to which all packages must adhere. Getting the entire R/CRAN ecosystem to agree on a single environment would be a lot more difficult.

LTLA commented 4 years ago

Based on the commentary above, I have added basiliskStart(), basiliskRun() and basiliskStop() to provide an abstraction layer under which basilisk attempts to choose the best way to do things. In particular, in basiliskStart():

Users and developers can influence these choices by directly or indirectly setting fork and global, though this should not be necessary in most cases.

vjcitn commented 4 years ago

Branch vc_basilisk of https://github.com/vjcitn/BiocSklearn uses basilisk to manage python packages related to scikit-learn. BiocSklearn's current incarnation in Bioc 3.10, which simply uses reticulate::import as needed, without specifying python package versions, may illustrate some advantages to the basilisk approach, as today's build/check outcomes for linux and macos are inconsistent. I reported that a drawback appears to be the installation of a complete python runtime in the BiocSklearn installation inst folder (~330MB) ... but I may be failing to take advantage of certain symbolic link opportunities.

LTLA commented 4 years ago

Some investigation indicates the size is due to the installed packages, which get installed fresh for every virtual environment, so it's not a symlink issue. The cleanest solution is probably to have basilisk pull down some "core" packages (e.g., scipy, numpy, pandas) and allow the virtual environments to re-use them to avoid chewing up space with unnecessary copies across packages. This should be possible with ignore_installed=FALSE but I don't know if some careful PYTHONPATH setting would be required in order to find the Python packages in the basilisk installation directory.

LTLA commented 4 years ago

@vjcitn The proposed solution is active as of 8b319f3ffcf13b2fde89d2b5473bdc50fc7232d2 and works pretty well for me:

$ du -sh basilisk/ son.of.basilisk/
624M    basilisk/
54M son.of.basilisk/

This shifts the content into basilisk, which should be fine if it reduces the redundancy that would otherwise occur if client packages all had their own copies of, e.g., numpy. (This is still permissible if it is necessary to resolve version conflicts, we've just switched the default for ignore_installed).

Now we just have to decide on a set of "core" packages... currently it's scipy, numpy, pandas and matplotlib. Adding more is possible but will inflate basilisk so it had better be good.

lawremi commented 4 years ago

If we're aiming to define a common base environment for Bioconductor, the more modules+versions we can specify, the greater the compatibility between packages. That doesn't mean we have to include the modules in the download of basilisk itself; it could lazily grow its installation.

mtmorgan commented 4 years ago

The idea of a 'common base environment' converges with efforts @kevinushey is pursuing, and is of course highly desirable from a user and build system perspective.

LTLA commented 4 years ago

the greater the compatibility between packages

Well, basilisk doesn't make any statements or guarantees about Python-Python compatibility across R packages, nor do I feel that would be a common use case. However, there would be benefits from avoiding the process-based fallbacks and reducing installation size of client R packages.

That doesn't mean we have to include the modules in the download of basilisk itself; it could lazily grow its installation.

We'd have to design appropriate fallbacks for situations when the downstream processes requesting lazy installation don't have access to the basilisk installation directory. This is... possible, I guess, but it could get messy pretty quickly. We could move the miniconda installation out of basilisk's directory but that raises other questions about where this thing would live, e.g., when installed by a user vs a sysadmin on a server environment.

There is also the problem of what happens when basilisk updates, as the lazily installed Python packages will all be lost when the installation directory is wiped. You would have to correspondingly bump all downstream packages to trigger lazy reinstallation. This is less of a problem with the current setup provided the miniconda installation itself doesn't change, as the symlinks in the client packages will just keep pointing to the same place.

Seems a lot cleaner to just have basilisk pull down a fixed set to start with. I will note, though, that R CMD build seems to chase symlinks, which may or may not be healthy.

LTLA commented 4 years ago

basilisk now has support for lazy installation of "core" packages. The deal is this:

  1. When basilisk is installed, it only pulls down miniconda. It also contains an in-built list of version-pinned core packages (see inst/core_list), which are somewhat arbitrarily chosen to be reasonably up-to-date and compatible with each other. Packages on this list are not installed (yet).
  2. When a client package of basilisk is installed, it calls setupVirtualEnv to create a client-specific virtual environment. We check how many of its Python dependencies overlap with the core list, and those dependencies are installed into the basilisk Python instance. We then recreate the virtual environment to give it a chance to use the newly installed core dependencies.
  3. If the process of the client installation lacks permissions to modify the basilisk installation directory, all packages get installed into the virtual environment, which is the next-best solution.
  4. If the client's requested packages require versions of the core packages that are incompatible with those in core_list, these also get installed into the virtual environment.

The idea is to avoid redundant installations and thus decrease the disk footprint of the basilisk framework while also avoiding a large upfront cost in Python package installation. We can thus add any number of "core" packages to core_list without worrying about time/memory costs. However, this does mean that any reinstallation of basilisk requires reinstallation of all client packages, as required Python packages would otherwise be silently lost when the lazy installation history is wiped.

For any developer of a client package (well, just @vjcitn right now), there are several major changes:

Again, check out inst/example for a minimal example, and read ?setupVirtualEnv for thoughts.

vjcitn commented 4 years ago

I wonder if there is a little hitch ... if you use devtools::document in a client package, inst/basilisk will be populated with a python environment (111MB in my case) in the source folders. you have to get rid of that before proceeding to install. maybe my development pattern (devtools::document() inside source to generate man files) is obsolete?

vjcitn commented 4 years ago

Otherwise this new approach works for the code at https://github.com/vjcitn/BiocSklearn/tree/vc_basilisk

LTLA commented 4 years ago

maybe my development pattern (devtools::document() inside source to generate man files) is obsolete?

I see this too. Don't know why it happens... guess it's due to some document() magic. The best solution is to add inst/basilisk/ to your .gitignore and .Rbuildignore for the time being.

Edit: Well, it turns out that document() will happily ignore .Rbuildignore as well, which leads to some very difficult bugs (due to the presence of hard-coded paths to the source code directory, which are obviously not sensible in the final installation locaion). Currently, the only remedy is to manually delete the inst/ artifacts after document()... not good.

Edit II: Fixed by the simple expedient of checking whether the installation directory for an R package actually ends with the name of that package.

LTLA commented 4 years ago

The final piece is now in play. basilisk will create a virtual environment that is transparently used by any client package that does not require Python packages outside of core_list. This means that multiple client packages can happily coexist in a single R session without requiring any of the less-efficient fallbacks, as they all just point to this common base environment. There are some conditions:

Another change is that we cache the miniconda installer script via BiocFileCache, to avoid having to redownload it every time we reinstall basilisk. It's actually surprisingly small for what it does, but it still has to do quite a lot, so we can cut out ~10 seconds from the install time by caching it.

There is a remaining question of whether we want to match core_list to Conda's package list. I don't know what the difference is between the OS's - I guess we could take an intersection of all of them? - and I also don't know how to get at these lists programmatically. But many things would be better than the current method of validating that the core set has no conflicts, see inst/README.md.

Edit: And now the circle is complete. Core lists for every OS are defined by scraping the Anaconda website and pulling out those tables for use as a constraint file in pip install -c. This means that basilisk is effectively mediating a lazy install of the entire Anaconda environment, starting from a miniconda base and populating the site-packages on demand. There's some weird bits where different OS's have mismatches in package availability, but whatever, that's Anaconda's problem.

mtmorgan commented 4 years ago

Looks like rstudio / reticulate have moved forward with their approach https://stat.ethz.ch/pipermail/r-package-devel/2020q1/004821.html . Any source for details @kevinushey ?

kevinushey commented 4 years ago

Please see:

vignette("python_dependencies", package = "reticulate")

for a vignette discussing the scheme.

LTLA commented 4 years ago

Looks cool. Could we re-use whatever code you wrote to handle the miniconda installation? Getting the (Ana)conda installer working is giving me endless grief on the BioC Windows build machines.

kevinushey commented 4 years ago

The function reticulate::install_miniconda() can be used to install Miniconda.

kevinushey commented 4 years ago

In addition, if you find reticulate::install_miniconda() doesn't do exactly what you need it to, please let me know and I can either try to accommodate that in reticulate, or you can feel free to copy and tweak the bits you need into the basilisk package itself.

LTLA commented 4 years ago

Thanks @kevinushey. It looks pretty good; and I just learnt about utils::shortPathName from reading its source code so I'm already coming out ahead re. a solution my Windows problems. I'll have a think about whether we should keep going with our current Anaconda strategy or use reticulate's miniconda installer and then populate it with some "core" set of Python packages.

And on that note, I just noticed the comment in ?virtualenv_create about the lack of Windows support. Whoops. Well, that might very be the cause of my problems. I would normally say that's kind of surprising, but I guess I should have seen that coming, Windows being Windows and all.