bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

slendr has trouble exporting conda paths #154

Closed janxkoci closed 7 months ago

janxkoci commented 7 months ago

When trying to parallelize slendr simulations, I run into the same issue using several different methods, which probably boil down to how slendr exports its conda paths.

In particular, I'm getting the infamous tskit not found errors, whenever I try to:

At the moment, I'm using GNU parallel to run simulations simultaneously, but it still doesn't work at MetaCentrum at all (the assigned compute nodes simply don't see the paths, even if I export them explicitly in the script, or if I use an interactive job).

As I mentioned elsewhere, slendr creates 3 distinct environments, which may be part of the reason why PBS Pro has problems tracking all the paths. But I get the same problem from inside R with active slendr environment when I try to use future packages, so the problem may be on the slendr's side, when its exporting paths to its tools.

Please, let me know if there is anything I can test or log, to help with resolving the issue. I use both R and conda (or mamba) pretty much daily, but I'm not very useful when it comes to Python.

bodkan commented 7 months ago

OK, I think I know what might be going on. This is indeed similar to other issues so I'll try to describe what's going on in a bit more detail because it might be helpful for other users of reticulate. I could even imagine some poor non-slendr soul searching for the same errors on Google landing here.

But first it might be useful to clarify some terminology:

I will get to the point (I'm also a heavy user of the future package and am developing a package which uses slendr to runs thousands of parallelized simulations so I recognize your problem) but first I want to establish some background.

TL;DR: There's no "problem on the slendr's side, when its exporting paths to its tools" because it's not doing any exporting of any paths. I think you're not running future correctly in this setting. But more on that after the example.

Example

To explain the above in concrete terms, it's worth following this along, in particularly for a non-expert Python user. Let's forget about slendr for a minute.

I don't have conda installed on my systems, but the issue isn't conda-related, it's Python-related, so it's more straightforward to explain without bringing conda into picture. As I explained yesterday, slendr only uses conda to fetch the python.exe / python binary across Windows/Linux/macOS.

First let's verify that a base Python doesn't have matplotlib installed. I chose matplotlib because it's easy to install and to show the issue is general and not specific to tskit (or even slendr).

$ python3
>>> import matplotlib         # matplotlib doesn't work!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'matplotlib'

Now let's create a tiny Python virtual environment which does have matplotlib installed:

$ cd /tmp
$ python3 -m venv testenv        # create a virtual environment
$ source testenv/bin/activate    # activate the environment
(testenv) $ python3 -m pip install matplotlib

$ python3
>>> import matplotlib
>>> matplotlib.__version__   # matplotlib works!
'3.8.3'

Now deactivate the test Python virtual environment to get a plain shell:

(testenv) $ which python3             
/private/tmp/testenv/bin/python3  # python3 interpreter with matplotlib is currently used
(testenv) $ deactivate
$ # now we're no longer in the Python virtual environment
$ which python3
/usr/local/bin/python3  # indeed, after deactivating, the default python3 interpreter is used

Let's now start R and try to import matplotlib (as an example of something "Python related" analogous to the "infamous tskit error"). No conda, no slendr, nothing! This is just to demonstrate that the issue you're encountering lies deeper than slendr, for the purposes of better understanding this problem.

$ R --quiet
> library(reticulate)
> import("matplotlib")
Error in py_module_import(module, convert = convert) : 
  ModuleNotFoundError: No module named 'matplotlib'     # <== the infamous error
Run `reticulate::py_last_error()` for details.

You see the "infamous tskit error", as you called it. This happens because the Python interpreter which was embedded into R the moment we executed some Python code (import("matplotlib") in R, i.e. import matplotlib in Python) is one which doesn't have matplotlib installed (typically some "default Python" on the system).

Now, let's close R altogether, and start it again. But this time we instruct R to pick up a Python environment we care about:

$ R --quiet
> library(reticulate)
> use_virtualenv("/tmp/testenv")   # this is what `init_env()` does internally -- this and nothing else
> import("matplotlib")
Module(matplotlib)

VoilΓ , beautiful. Stuff works! And the reason it works is that we first activated a Python virtual environment before doing anything Python-related (either venv or conda environment, doesn't matter at all which one). If the Python environment of slendr isn't properly activated first by the R session, there's nothing that can be done by slendr.

Lessons for slendr

The lesson from above is that if an R code or an R package requires a Python virtual environment, it must be available and activated before any call to any Python is performed by the script! If that doesn't happen, R automatically activates whatever Python is available by default, and then calling something like slendr's msprime() function simply fails with that "infamous error".

slendr specific example

If you load slendr and run a script to simulate data without taking care about initializing its internal Python environment (i.e. without calling init_env()), it will fail. Similarly, it will also fail if that environment gets somehow corrupted (like with interrupted process of setup_env()):

library(slendr) # we load slendr BUT DON'T INITIALIZE ITS PYTHON ENVIRONMENT (init_env())

pop <- population("pop", time = 1000, N = 1000)
model <- compile_model(pop, generation_time = 1, direction = "backward")

ts <- msprime(model, sequence_length = 10000, recombination_rate = 1e-8) # THIS FAILS!

Error:

Traceback (most recent call last):
  File "<path>/script.py", line 16, in <module>
    import tskit
ModuleNotFoundError: No module named 'tskit'

Indeed, when we run check_env() we see that none of slendr's internal Python modules are available in the R session:

> check_env()
Summary of the currently active Python environment:

Python binary: /Users/mp/Library/r-miniconda-arm64/envs/r-reticulate/bin/python 
Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:35:41)  [Clang 16.0.6 ] 

slendr requirements:
 - tskit: MISSING ❌ 
 - msprime: MISSING ❌ 
 - pyslim: MISSING ❌ 
 - tspop: MISSING ❌ 

Note that due to the technical limitations of embedded Python, if you
want to switch to another Python environment you will need to restart
your R session first.

This is because by default, if no explicit Python environment activation has happened, the moment some Python code is run by slendr (such as msprime() function above), R picks up whatever Python interpreter is available by default (here it's Python sitting at /Users/mp/Library/r-miniconda-arm64/envs/r-reticulate/bin/python, but it will differ on your system) and just runs with it. Obviously, unless that interpreter has tskit, msprime, etc. available, the simulation is going to fail.

The solution is to call init_env() before doing anything else with slendr, like this:

(not that R session must be restarted, as instructed by the error message just above)

library(slendr) # we load slendr
init_env()      # WE INITIALIZE SLENDR'S PYTHON ENVIRONMENT

pop <- population("pop", time = 1000, N = 1000)
model <- compile_model(pop, generation_time = 1, direction = "backward")

ts <- msprime(model, sequence_length = 10000, recombination_rate = 1e-8) # THIS SUCCEEDS!

Indeed, when we run check_env() again in this R session, we see that the init_env() call properly activated slendr's internal Python environment with all of its dependencies (see that the path to the "Python binary" is now different to the default path used in the previous, failing example):

> check_env()
Summary of the currently active Python environment:

Python binary: /Users/mp/Library/r-miniconda-arm64/envs/Python-3.12_msprime-1.3.1_tskit-0.5.6_pyslim-1.0.4_tspop-0.0.2/bin/python 
Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] 

slendr requirements:
 - tskit: version 0.5.6 βœ“ 
 - msprime: version 1.3.1 βœ“ 
 - pyslim: version 1.0.4 βœ“ 
 - tspop: present βœ“ 

If you suspect the Python environment is corrupted, or that it's broken because setup_env() has failed, you will have to clear whatever Python slendr tried to pick up by calling clear_env(), followed by running setup_env() again. You will also have to do it in a fresh R session as well.

slendr specific example (parallelization)

Let's look at an example of how the above Python-vs-reticulate-vs-R embedding issue applies to running stuff in parallel, using the future R package as basis for parallelization. I will demonstrate running parallelized slendr simulations on a single machine in a sequential, multicore, and multisession settings (I will assume you are familiar with those as you said you work with those packages yourself).

I am not showing how to run this parallelized across different machines because the same applies -- you just need to make sure that slendr is installed on each remote machine and that setup_env() has been created on each of them, of course. Also, I use future_map_dfr, but any of the future_lapply, etc. work the same, of course.

Let's say I have this setup in which I want to parallelize the computation of nucleotide diversity using slendr. (so, taking the example above and running it in parallel). I'll demonstrate this on four scenarios, one of which fails in the way you describe.

library(slendr)
init_env(quiet = TRUE) # this internally calls `reticulate::use_*env() as above`

library(furrr) # future iterative wrappers

run_sim <- function(rep) {
  pop <- population("pop", time = 1000, N = 1000)
  model <- compile_model(pop, generation_time = 1, direction = "backward")

  ts <- msprime(model, sequence_length = 10000, recombination_rate = 1e-8)
  pi_df <- ts_diversity(ts, sample_sets = ts_names(ts, split = "pop"), mode = "branch")
  pi_df$rep <- rep
  pi_df
}

Scenario I. -- sequential, works fine

Nothing to see here. Because init_env() has been called in this R session (see the code chunk just above), run_sim() can pick it up because it runs in the same process.

plan(sequential)
pi_results_I <- future_map_dfr(1:10, run_sim)

Scenario II. -- multicore, works fine

Mode "multicore" runs futures in a shared memory.

This works because futures are evaluated in a shared memory forked from the current R process, which means that the forked processes share the Python interpreter embedded via the call to init_env() above into the "parent process".

plan(multicore, workers = 10)
pi_results_II <- future_map_dfr(1:10, run_sim)

Scenario III. -- multisession, FAILS!

Executes run_sim() in a multisession mode -- this gives the error!

This fails because futures are evaluated in independent processes! Those obviously can't share the embedded Python environment of the "parent R process" because, well, they are independent. As I showed above, if no Python virtual environment is initialized explicitly, whatever Python is lying around by default is used and embedded into the R session.

plan(multisession, workers = 10)
pi_results_III <- future_map_dfr(1:10, run_sim) # FAILS!
import tskit
ModuleNotFoundError: No module named 'tskit'

Scenario IV. -- multisession, FIXED!

This is the same as scenario III. above, but no longer gives the error. Although each future is again evaluated in an independent process like those in scenario III., this works because we initialize the correct embedded Python in each parallelized future independently.

plan(multisession, workers = 10)

run_sim_modified <- function(rep) {
  init_env(quiet = TRUE)                                     # <== note this change to run_sim() above!

  pop <- population("pop", time = 1000, N = 1000)
  model <- compile_model(pop, generation_time = 1, direction = "backward")
  ts <- msprime(model, sequence_length = 10000, recombination_rate = 1e-8)

  pi_df <- ts_diversity(ts, sample_sets = ts_names(ts, split = "pop"), mode = "branch")
  pi_df$rep <- rep
  pi_df
}
pi_results_IV <- future_map_dfr(1:10, run_sim_modified)

Conclusion

I hope the long (sorry!) write up clears things up a little. I suggest you play around with the examples I provided and try to relate the conclusions from them to your own situation.

In particular, the fact that any independent R process which uses slendr (and, by extension, it's dedicated Python environment) must activate that Python environment before any Python function is called. If you don't do this, by that point you try to simulate something it's already too late, because R embedded some default Python which is unlikely to have msprime/tskit/pyslim installed. So, any piece of slendr code that is to be run in parallel must call init_env() before anything else, particularly code which is run in a parallelized way across different machines in independent processes.

As an example, I'm working on an ABC package for slendr / SLiM / msprime which involves running potentially millions of parallelized simulations across cores on different machines using futures. Here's an example of an internal simulation function (potentially ran across different computers in independent processes) in that project which, in order to avoid the infamous error, makes sure a dedicated slendr Python environment is always activated before anything else.

bhaller commented 7 months ago

Wow, what a writeup, @bodkan! Make sure to put some of this info into your doc!

bodkan commented 7 months ago

That's a great point, @bhaller. Thanks. I should start a F.A.Q. vignette with this being item number 1.

This issue luckily doesn't come as often as it used to -- it was way worse back when slendr was trying to activate the built-in Python virtual environment during the library(slendr) call. Any time user's Python stuff was misconfigured, it pulled the rug from under slendr without nothing I could do about it.

Splitting the loading of slendr (library(slendr)) and initiating its Python environment (init_env()) into two individual explicit steps solved most of the obscure low-level issues. I wasn't happy about making this change at first (most of my users don't actually know what Python really is and how it works, which is why I liked the "magic" aspect to this) but it turned out to be a nice example of "explicit is (sometimes) better than implicit".

bhaller commented 7 months ago

The next time I have Python issues, I know who I'm gonna ask ;-)

janxkoci commented 7 months ago

Wow, thanks for this detailed reply, @bodkan! I don't recall all the details of my attempts (I'll check the scripts at the HPCs when I have time), but most likely I used multisession without any special treatment, which leads to the error. I've found multisession to be more reliable than multicore on older systems like CentOS7, which are common at HPCs.

bodkan commented 7 months ago

Yes, especially when you say that you probably used the "multisession" mode, I'm pretty much certain you were running the simulations without activating the Python environment first. This is the only situation the error can happen because it comes from slendr evaluating import tskit (first in the line of imported Python modules, hence the error mentioning tskit and not another Python module) in an R session which didn't have the correct Python environment activated.

Good luck!


I will leave this issue open as a reminder to link to the detailed explanation somewhere in documentation. Probably after cleaning it up a bit and adding a new section between "Example" and "slendr specific example" which will prove the source of the error without bringing parallelized futures into the picture.

bodkan commented 7 months ago

Added a brief description of the error with the link to the writeup above to the slendr website. I expanded it with a basic non-parallelized example of how the error happens. Together with the non-slendr pure R/Python example, this will hopefully become a useful resource to direct users to in case they run into the same problem.

Closing.