dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

WISH: Less aggressive parallelization by default (please don't use *all* CPU cores) #333

Open HenrikBengtsson opened 2 years ago

HenrikBengtsson commented 2 years ago

Hi, I noticed text2vec runs on all CPU cores by default on Unix. This is from:

https://github.com/dselivanov/text2vec/blob/9ddf836b995511d8747cc98f753e9cc706cf3c84/R/zzz.R#L6-L9

https://github.com/dselivanov/text2vec/blob/9ddf836b995511d8747cc98f753e9cc706cf3c84/R/mc_queue.R#L1-L4

Defaulting to all cores causes major problems on machines used by multiple users, but also when there are software tools running at the same time. I spotted this on a 128 CPU core machine. Imagine running another 10-20 processes like that at the same time on this machine - it'll quickly come to a halt, which is a real problem.

Although the behavior can be changed by setting an R option, many users are not aware of the problem ... until the sysadms yell at them. Also, text2vec might be running deep down as a dependency that other package maintainers might not be aware of, so this behavior might be inherited also be other packages without them knowing.

Could you please consider switch the default to be more conservatively. Personally, I'm in the camp that everything should run sequentially (single-core), unless the user configures it otherwise. CRAN has a limit of two CPU cores.

(Disclaimer: I'm the author) If you don't want to do this, could you please consider changing from:

parallel::detectCores(logical = FALSE)

to

parallelly::availableCores(logical = FALSE)

because the latter gives sysadms a chance to limit it on their end, and it also respects CGroups settings, job scheduler allocations, etc. Please see https://parallelly.futureverse.org/#availablecores-vs-paralleldetectcores for more details.

Thank you

HenrikBengtsson commented 2 years ago

... Also, text2vec might be running deep down as a dependency that other package maintainers might not be aware of, so this behavior might be inherited also be other packages without them knowing.

I don't have time to narrow it in to 100%, but I suspect this happens to oolong, when running R CMD check on it. It's a package that does import any parallel frameworks itself, but it spins off 100+ parallel workers when being checked, including when checking it's vignette.

dselivanov commented 2 years ago

I do agree, but in practice it's even more complicated - it won't help much because users also need to be aware of threaded BLAS and and code from threaded solvers (from rsparse package for example). I feel this can only be solved if series of packages are designed carefully by a single responsible author. Nevertheless I will consider to change default behaviour to single thread.

On Sun, 8 May 2022, 11:35 Henrik Bengtsson, @.***> wrote:

... Also, text2vec might be running deep down as a dependency that other package maintainers might not be aware of, so this behavior might be inherited also be other packages without them knowing.

I don't have time to narrow it in to 100%, but I suspect this happens to oolong https://cran.r-project.org/web/packages/oolong/index.html, when running R CMD check on it. It's a package that does import any parallel frameworks itself, but it spins off 100+ parallel workers when being checked, including when checking it's vignette.

— Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/333#issuecomment-1120342870, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHC5XLZQOYJ4RGPLXOCYZTVI4ZAHANCNFSM5VKWK3FA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

HenrikBengtsson commented 2 years ago

Thank you. Yes, is a long journey and more so since it's hard to convince R Core to provide a built-in mechanism to control and protect against this. I try to build up such a mechanism with parallelly and for those who choose to parallelize via the future ecosystem, there's also a built-in, automatic protection against recursive parallelism. There I'm hoping to attack also multi-threaded processing, which, as you mention, also comes in play. Not sure how that should be done the best way, but it's clearly a growing potential problem too. That also has the problem that it's not stable in forked parallelization, but R doesn't allow us to protect against that either.

I try to raise awareness wherever I can, especially since this will be a growing problem as more and more tools support parallel processing. Luckily, from empirical admin observation on large academic HPC clusters, it looks like most software run sequentially by default.

I appreciate your considerations