Issue running rstan with BiocParallel MulticoreParam

phauchamps commented 3 years ago

Summary: When running the same stan model in parallel on a high number of datasets on multiple cores, using BiocParallel package with MulticoreParam back-end, I get an error message linked to shared objects loading.

Description: In the context of a proteomics research work, I’d like to run the same model on a number of different datasets (1000+) in parallel on a number of cores (6 in my case). For this I am trying to use BiocParallel with its MulticoreParam back-end (on Linux). Note I don’t use the paralelization feature present in Stan (cores = 1).

This works fine for a fairly reasonable number of models (100) but when increasing further the number of models, while keeping the same number of cores, I get systematically an error message :

Error: BiocParallel errors element index: 271 (or other element index depending on run) unable to load shared object ‘tmp/Rtmp7TxrWl/file249d104be97c44.so’ tmp/Rtmp7TxrWl/file249d104be97c44.so : file too short

and as soon as this happens the rest of the jobs all fail with the same type of error message.

I tried to run the batch in serial mode (SerialParam in BiocParallel) and this works fine, so it is unlikely to be due to the data specifics of one model in the series.

Since I suspected it might be related to a resource shortage issue (e.g. memory), I also tried to decrease the number of cores used in order to limit the number of jobs run simultaneously, but even with 2 cores the issue appears. I also tried to decrease the number of chain iterations to a very low number but again the issue is still there.

Anyone having experienced the same kind of issue in the past (maybe with other packages used in conjunction with MulticoreParam) and found the root cause and a solution ?

Unfortunately it will be difficult to provide anything reproducible, since I understand that to get the error you should have the same environment (OS etc.)

RStan Version: 2.21.2

R Version: 4.0.3

Operating System: Manjaro Linux 20.2.1

mtmorgan commented 3 years ago

Have you tried with SnowParam(), which is probably a better bet for complicated tasks? This also enforces a model where the packages, etc., are loaded on the worker, and each worker is independent of one another.

Thus instead of something like

library(RStan)
FUN = function(x, <other args>) {
    ## do stuff with RStan
}
bplapply(X, FUN, <other args>, BPPARAM = MulticoreParam())

you would need to

FUN = function(x, <other args>) {
    library(RStan)
    ## do stuff with RStan
}
bplapply(X, FUN, <other args>, BPPARAM = SnowParam())

If your code is structured with several bplapply() with the same FUN(), or with FUN1(), FUN2(), ... requiring similarly configured workers, it might make sense to amortize the cost of configuring each worker by starting the worker once

param = SnowParam()
bpstart(param)
bplapply(X, FUN1, <other args>, BPPARAM = param)
...
bplapply(X, FUN2, <other args>, BPPARAM = param)
bpstop(param)

The workers persist between bplapply() invocations, so the cost of loading libraries, etc., is only paid once.

phauchamps commented 3 years ago

Hi Martin,

Thanks a lot for your swift answer.

I can indeed try using SnowParam in my case. What could be the reasons according to you, that would make MulticoreParam unsuited to handle complex tasks ?

The reason I found MulticoreParam convenient was that according to my very limited experience, MulticoreParam is very efficient in creating tasks and can share memory very easily. On the other hand SnowParam creates a full R instance which turned out to be rather slow.

In my case, I don't think I can re-use my code as is with SnowParam; indeed I am using bpmapply() with X being just an array of model indexes. Therefore the bulk of useful data stands in big data structures that are passed as MoreArgs argument of the bpmapply() function. This is obviously poor design, but for reasons that are mainly due to legacy of using rstan, the data structure cannot be split easily into smaller pieces belonging solely to its respective job. Therefore using the code in its current design would most probably lead to explosion of RAM usage, as I would expect that each worker would need to duplicate the full data set.

To come back to my issue with MulticoreParam, what is really striking, is that the error always happen around the same number of already run tasks, i.e. between 265th and 280th tasks, and this even if I select the tasks in a different order or even run different subsets! I also posted my issue on Stan user forum, and the very first answer is suggesting my problem might be due to concurrent access to a common ressource, i.e. the file containing the Stan model in its compiled state (needed as recompiling the model each time is prohibitive in terms of CPU time and memory needed for the compilation process). However if this was the case I wonder why the issue would only pop up after 260+ run tasks in such a reproducable pattern ?!?

Best Regards,

Philippe

mtmorgan commented 3 years ago

MulticoreParam() might be OK if RStan (I'm guessing that's the problem; it's hard to know without a minimal reproducible example) is loaded solely inside FUN, so

FUN = function(x, <other args>) {
    library(RStan)
    ## do stuff with RStan
}
bplapply(X, FUN, <other args>, BPPARAM = MulticoreParaam())

The 'fails after 265th and 280th tasks' observation could be quite important, e.g., perhaps your code is exhausting the number of open connections to files, or the number of connections RStan supports. Again to provide concrete advice would require a (simple) reproducible example.

Probably a 'minimal reproducible example' wouldn't be of your analysis, but rather doing the simplest possible RStan model that triggers the problem, probably not related to data size, etc.

phauchamps commented 3 years ago

Very good points! I will try to set-up such a simple, reproducible example indeed. Open connections to file is also a very good hint I think, will dig into that. Thanks Martin !

phauchamps commented 3 years ago

Hi again @mtmorgan , I created a simple R script that enables me to reproduce the problem on a much more straightforward and smaller scale. While playing with it I noticed the following : while STAN stores on disk the result of compilation of a model (for subsequent use), using this feature leads to the error described above. However if I remove the pre-compiled model object from the disk, STAN has to recompile the model script first. In the latter case sharing the compiled model object does not lead to any error. I tested this contrasted behaviour several times.

Fortunately the model compilation can be done once before starting the batch job on MulticoreParam. This means that I actually found a work-around (although time consuming and inelegant).

As a conclusion, I am closing the issue in this repo, obviously it is most probably due to either a bug in STAN, either a wrong use of STAN (which is somehow the same :-) ). I'll post my reproducible example on STAN repo. Thanks a lot for your help ! Philippe

phauchamps commented 3 years ago

Comment already in last post :-)

Bioconductor / BiocParallel

Issue running rstan with BiocParallel MulticoreParam #134