drizopoulos / JMbayes2

Extended Joint Models for Longitudinal and Survival Data
https://drizopoulos.github.io/JMbayes2/
78 stars 22 forks source link

Wide variation in jm`s performance #23

Closed JaderGiacon closed 2 years ago

JaderGiacon commented 2 years ago

Dear Drizopoulos,

I am trying to investigate the wide performance variation for jm that a user of our HPC system reported. The difference of time run for different environments are very big and, at least for me and other colleagues, it should not happen.

Below I list the time run for different tests I have done. After I give some details of the jm.

User laptop: Corei9 32GB - Reported that jm runs in 4h HPC : Intel Xeon Platinum 48 cores (no Hyperthreading) - Slurm as job scheduler and allocate resources via cgroup Test with 4 cores 64GB RAM - 22h Test with 16 cores 128GB RAM - I killed after 22h of execution Drained a node to run without slurm: Entire node of 48 cores and 762GB RAM - I killed after 24h (it was still running)

To try to simulate the user laptop, I have run in cloud as well: r4 instance with 1 core and 32GB RAM - Finished in 5h r4 instance with 2 cores (2 threads per core) - Finished in 17.8h

The run time is very different even for servers with similar configuration. Also, it seems when increase number of cores the runtime increase a lot. Even given an entire server it is taking long time to run comparing with user laptop. The Intel Xeon Platinum is a better processor than Core i9

I am not expert in R and in JM. My guess it is something like cost function that it varies according to each step of each execution.

The code is running is: MCMC summary: chains: 4 iterations per chain: 27500 burn-in per chain: 2500

pacman::p_load(JMbayes2, splines, nlme, survival, dplyr, install=FALSE) E607 <- readRDS("objects/E607.rds") CHHiP <- E607 %>% select(trialno, PSAValue_tdc, Tmax, cens_endpt_NonCR0_1, tstart, phase , age, arm , ISUP_gleasGG, clint_cat_2 , hthintd_2 ) #%>% print(head(CHHiP))

ID_E607 <- readRDS("objects/ID_E607.rds") SuC_01 <- readRDS("objects/SuC_01.rds") LME_full_ME_noPhase_pdDiag_ML <- readRDS("objects/LME_full_ME_noPhase_pdDiag_ML.rds")

(fForms <- readRDS("objects/fForms.rds"))

JMB2_tstart5_v0.23_GitHub <- jm(Surv_object = SuC_01, Mixed_objects = LME_full_ME_noPhase_pdDiag_ML, time_var = "tstart", functional_forms = fForms, n_iter = 27500L, n_burnin = 2500L, n_chains = 4L)

saveRDS(JMB2_tstart5_v0.23_GitHub, "objects/JMB2_tstart5_v0.23_GitHub.rds")

The same code was run in different servers listed above. The data and code was in a local fast NVMe storage and was executed with command below: Rscript --vanilla JMB2_tstart5_v0.23_GitHub.r --tmp=./

R and library versions: R version 4.1.0 (2021-05-18) Platform: x86_64-conda-linux-gnu (64-bit)

attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base

other attached packages: [1] dplyr_1.0.8 JMbayes2_0.2-0 GLMMadaptive_0.8-5 nlme_3.1-155 [5] survival_3.2-13

loaded via a namespace (and not attached): [1] Rcpp_1.0.8 magrittr_2.0.3 MASS_7.3-55 tidyselect_1.1.2 [5] munsell_0.5.0 colorspace_2.0-2 lattice_0.20-45 R6_2.5.1 [9] rlang_1.0.2 fansi_1.0.3 parallel_4.1.0 grid_4.1.0 [13] gtable_0.3.0 pacman_0.5.1 utf8_1.2.2 cli_3.2.0 [17] coda_0.19-4 ellipsis_0.3.2 matrixStats_0.61.0 tibble_3.1.6 [21] lifecycle_1.0.1 crayon_1.5.1 Matrix_1.4-0 gridExtra_2.3 [25] purrr_0.3.4 ggplot2_3.3.5 vctrs_0.4.0 glue_1.6.2 [29] compiler_4.1.0 pillar_1.7.0 generics_0.1.2 scales_1.1.1 [33] pkgconfig_2.0.3

Any information why it has a huge difference in runtime even using better CPU and more cores is much appreciate. I know some libraries are designed for specific CPU, number of cores etc. I am not sure about jm.

Kind Regards Jader

drizopoulos commented 2 years ago

Thanks for reporting this. I do not have access to servers with this configuration to test this myself. Nonetheless, a couple of points:

JaderGiacon commented 2 years ago

Hi @drizopoulos,

Thanks for the information. Sorry not mention clock speed. The servers I have tested on-promise are Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz (up to 3.1GHz). The R4 instance in Cloud are Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (up to 3.0GHz). Both Intel Xeon are server level processor with better architecture than Corei9. I do not think it should run slower than a laptop. For example, the same run in two R4 instances in Cloud had a huge difference in runtime.

It is good to know about n_chains. It explains why there were 4 processes running as: R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11561 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=parallel

However, the number of threads were totally different for each environment. For example, when it was running in an entire server with 48 cores, each process, as listed above, had 56 threads (4 process x 56 threads). When it was running in a 4 cores server it had 8 threads each one.

I will discuss with the user about parallel::makeCluster(), but I do not think they will change jm library code for other parallelism tool.

My concern in reporting this issue is that the library could have a bug for certain types of parameterization and not run efficiently in different number of cores. It is possible to reproduce the problem if you have access to a Cloud.

I will let you know the further tests I will be doing.

Thanks in advance Kind Regards Jader

drizopoulos commented 2 years ago

It would be good that you try to run it without loading any other packages. Perhaps some of the other packages also set-up clusters interfering with JMbayes2. Perhaps a very minimal versions just that loads the data using readRDS() (i.e., do all data management steps beforehand) and then run the model.

JaderGiacon commented 2 years ago

Thanks,

I can have a try. My main concern is the difference between a laptop and server with same R libraries loaded. But the laptop has many other libraries installed, even without use them. At this moment I have installed the doParallel package to check if it makes any difference.

drizopoulos commented 2 years ago

I do not think loading the doParallel package will make a difference because my package uses the parallel package. But just to check: Are you running things in parallel on top of what the JMbayes2 package is doing on its own? If this is the case, then perhaps mixing different packages to do parallelization may result in problems.

JaderGiacon commented 2 years ago

No. The code it is running is the example I put in the report. It is the code that the user gave to me to reproduce the problem. The function that takes long time is: JMB2_tstart5_v0.23_GitHub <- jm(Surv_object = SuC_01, Mixed_objects = LME_full_ME_noPhase_pdDiag_ML, time_var = "tstart", functional_forms = fForms, n_iter = 27500L, n_burnin = 2500L, n_chains = 4L)

It runs 4 processes like: R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11616 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=parallel

JaderGiacon commented 2 years ago

Hi @drizopoulos,

Up to now, I do not have any clue of the reason in a server with 1 core or user laptop it runs 5 x faster. At this moment I am testing the Minimal Example you have in the README. The same code.

It is running for 1:40h. I will run in 1 core in cloud as well to compare. I am also running in another node the minimal example with Rprof for the jm function.

I let you know the results.

Thanks in advance for all help.

JaderGiacon commented 2 years ago

Hi @drizopoulos,

Just to let you know the test in a cloud with 1 core of the Minimal Example took: MCMC summary: chains: 3 iterations per chain: 12000 burn-in per chain: 2000 thinning: 5 time: 7.9 min

Less than 8 min.

The same code is still running for more than 1h in another cloud instance with 4 cores (2 x 2). Also, the same example in a physical server with 4 cores.

There is something weird in the library when run with more than 1 core.

So, the important parts of Rprof for this case of 8min run: $by.self self.time self.pct total.time total.pct "mcmc_cpp" 463.26 96.84 463.26 96.84 "mlogLik_jm" 7.60 1.59 7.60 1.59 "colLogSumExps" 0.54 0.11 0.54 0.11 "FUN" 0.46 0.10 467.20 97.66 "rowSums" 0.04 0.01 0.04 0.01 "jm_fit" 0.02 0.00 477.48 99.81 "lm" 0.02 0.00 1.20 0.25

$by.total total.time total.pct self.time self.pct "jm" 478.40 100.00 0.00 0.00 "jm_fit" 477.48 99.81 0.02 0.00 "lapply" 468.74 97.98 0.00 0.00 "FUN" 467.20 97.66 0.46 0.10 "mcmc_cpp" 463.26 96.84 463.26 96.84 "mlogLik_jm" 7.60 1.59 7.60 1.59 "doTryCatch" 2.56 0.54 0.00 0.00 "try" 2.56 0.54 0.00 0.00 "tryCatch" 2.56 0.54 0.00 0.00

I hope the Rprof for the multiple core tests can help identify which is causing the slowness.

Thanks

drizopoulos commented 2 years ago

Are these results from the server that takes too long? It would help to provide the results under both circumstances.


Από: JaderGiacon @.> Στάλθηκε: Tuesday, April 12, 2022 1:55:29 PM Προς: drizopoulos/JMbayes2 @.> Κοιν.: Dimitris Rizopoulos @.>; Mention @.> Θέμα: Re: [drizopoulos/JMbayes2] Wide variation in jm`s performance (Issue #23)

Waarschuwing: Deze e-mail is afkomstig van buiten de organisatie. Klik niet op links en open geen bijlagen, tenzij u de afzender herkent en weet dat de inhoud veilig is. Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi @drizopouloshttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdrizopoulos&data=04%7C01%7Cd.rizopoulos%40erasmusmc.nl%7C1127225ee61342ac782e08da1c7b59fc%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C0%7C637853613361808514%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ach12WDZ8HAVpEG05plmB%2Bdp0qUNiuMTNW1TZAUAraM%3D&reserved=0,

Just to let you know the test in a cloud with 1 core of the Minimal Example took: MCMC summary: chains: 3 iterations per chain: 12000 burn-in per chain: 2000 thinning: 5 time: 7.9 min

Less than 8 min.

The same code is still running for more than 1h in another cloud instance with 4 cores (2 x 2). Also, the same example in a physical server with 4 cores.

There is something weird in the library when run with more than 1 core.

So, the important parts of Rprof for this case of 8min run: $by.self self.time self.pct total.time total.pct "mcmc_cpp" 463.26 96.84 463.26 96.84 "mlogLik_jm" 7.60 1.59 7.60 1.59 "colLogSumExps" 0.54 0.11 0.54 0.11 "FUN" 0.46 0.10 467.20 97.66 "rowSums" 0.04 0.01 0.04 0.01 "jm_fit" 0.02 0.00 477.48 99.81 "lm" 0.02 0.00 1.20 0.25

$by.total total.time total.pct self.time self.pct "jm" 478.40 100.00 0.00 0.00 "jm_fit" 477.48 99.81 0.02 0.00 "lapply" 468.74 97.98 0.00 0.00 "FUN" 467.20 97.66 0.46 0.10 "mcmc_cpp" 463.26 96.84 463.26 96.84 "mlogLik_jm" 7.60 1.59 7.60 1.59 "doTryCatch" 2.56 0.54 0.00 0.00 "try" 2.56 0.54 0.00 0.00 "tryCatch" 2.56 0.54 0.00 0.00

I hope the Rprof for the multiple core tests can help identify which is causing the slowness.

Thanks

— Reply to this email directly, view it on GitHubhttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdrizopoulos%2FJMbayes2%2Fissues%2F23%23issuecomment-1096628696&data=04%7C01%7Cd.rizopoulos%40erasmusmc.nl%7C1127225ee61342ac782e08da1c7b59fc%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C0%7C637853613361808514%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3N2HtRY%2BUBIdvw6D4M4W%2BW1rhBS2YF651%2FZVEW%2F8WUQ%3D&reserved=0, or unsubscribehttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADE7TT2NEJDAYS6NKVVF7XTVEVQDDANCNFSM5S62BAQA&data=04%7C01%7Cd.rizopoulos%40erasmusmc.nl%7C1127225ee61342ac782e08da1c7b59fc%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C0%7C637853613361808514%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0PLcNcmS%2F6Vv5IYmUx5kL0Fp0nCiQA%2FZmvomYHoy7g8%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

JaderGiacon commented 2 years ago

The results are from Cloud server with 1 core. The results for same Cloud server with 2 cores and 2 threads per core is still running for Minimal Example (many hours). I will provide when it finish.

JaderGiacon commented 2 years ago

Hi @drizopoulos,#

See below the results for the same Cloud instance, but with 2 cores and 2 threads per core. Almost 9h to run the Minimal Example that have run in 8 minutes in 1 core.

MCMC summary: chains: 3 iterations per chain: 12000 burn-in per chain: 2000 thinning: 5 time: 8.6 hours

[1] "Profiling - Only the longest ones"

$by.self self.time self.pct total.time total.pct "mlogLik_jm" 1286.66 98.94 1286.66 98.94 "acf" 0.94 0.07 1.24 0.10 "unserialize" 0.78 0.06 0.78 0.06 "FUN" 0.74 0.06 9.82 0.76

"sys.function" 0.04 0.00 0.04 0.00 "jm_fit" 0.02 0.00 1299.62 99.94 "spectrum0.ar" 0.02 0.00 6.60 0.51

$by.total total.time total.pct self.time self.pct "jm" 1300.42 100.00 0.00 0.00 "jm_fit" 1299.62 99.94 0.02 0.00 "mlogLik_jm" 1286.66 98.94 1286.66 98.94 "lapply" 10.94 0.84 0.04 0.00 "FUN" 9.82 0.76 0.74 0.06 "spectrum0.ar" 6.60 0.51 0.02 0.00 "tryCatchList" 5.32 0.41 0.02 0.00 "tryCatch" 5.32 0.41 0.00 0.00 "doTryCatch" 5.30 0.41 0.00 0.00 "try" 5.30 0.41 0.00 0.00 "tryCatchOne" 5.30 0.41 0.00 0.00 "summary.mcmc.list" 5.06 0.39 0.00 0.00 "safespec0" 4.26 0.33 0.02 0.00

This test use the Minimal Example with only JMBayes library being loaded.

JaderGiacon commented 2 years ago

Hi @drizopoulos,

I have been analysing the code and checked the parallel library. There are many issues with use of parallel to detect cores. Including in its help mention it: Note This is not suitable for use directly for the mc.cores argument of mclapply nor specifying the number of cores in makeCluster. First because it may return NA, second because it does not give the number of allowed cores, and third because on Sparc Solaris and some Windows boxes it is not reasonable to try to use all the logical CPUs at once.

Many other issues are listed in the library parallelly which try to sort out the problem in parallel: https://cran.r-project.org/web/packages/parallelly/readme/README.html

What is intriguing me is the fact when uses n_chains=4 (Ok, I understood the jm_fit get the minimum between n_chains and detectCores), it creates 4 process (one in each CPU as expected), but each process has 8 threads. I do not think it is a good strategy for something is CPU intensive. The threads fight for CPU slot can compromise a lot the performance. Also, I do not know if each chain calculation is being done in each CPU or if all calculation for all chains are being done 4 times.

I have run with n_chains=1 in 1 core to test and it created one thread in the process. Also, it have run in 1h for the example that have run 21h using n_chains=4 and 4 cores CPU.

I am not expert in R. I would like to test with n_chains=4, but running only one thread per chain. I don not know how to do it yet.

I will fork the code to do some experiments in the code. But I will need to study R, how to compile etc. Any tips for test the code quickly is much appreciate. I am not expert in R development. I am just debugging the library execution in a infrastructure perspective.

Thanks

JaderGiacon commented 2 years ago

Hi @drizopoulos,

Did you have a chance to have a look the code regarding number of threads per process?

I think there is something weird with parLapply when it is calling the (cl, chains, mcmc_parallel,...) In theory, it should take the cluster (for example cl=4), chains (for example 1, 2 ,3 ,4) and pass each chain to mcmc_parallel to run in each process.

So, it should run one mcmc_parallel thread in each CPU. But it is running 4 threads in each CPU. It is like all chains are running in each CPU.

JMBayes2_4chains

Thanks in advance.

drizopoulos commented 2 years ago

I think the way I'm using it is the "official" way to use it. I don't see how is should be differently.

JaderGiacon commented 2 years ago

Hi @drizopoulos,

I am pretty sure you are using the best way to do it. I was unhappy in the writing of word "wrong". I already fixed it to "weird". My apologies for that.

Regarding the tests, my lack of knowledge in R does not allow me go further. Also, I do not have authorisation to give too many hours for that. However, I could test parallel, parLapply and lapply functions and understand better the operation.

First I was thinking the high number of threads for each parLapply process was something weird in this function. After some tests to understand their operation my suspicious turned out to "mcmc_cpp" function. But I could not identify the part of this function that it is creating multiple threads.

Any information about it would be great.

Thanks in advance Kind Regards Jader

drizopoulos commented 2 years ago

The mcmc_fit() function is a C++ function that combines Rcpp and RcppArmadillo. I'm not aware if something in the interfaces of these packages in servers causes any issues. Note that the parallelization is done in R and not in C++.

JaderGiacon commented 2 years ago

Thanks for the explanation.

Unfortunately, I am not authorised to give too many hours on it. We have found an workaround to force the library to "see" one CPU inside a Slurm allocation. Thus it is running in a similar time as the user`s laptop.

I have sent previously how to reproduce the problem. It is easy to reproduce in a Cloud instance. Just running in a Linux Instance with 1 CPU and with more CPUs. I really appreciate if the community could test it in their HPC or Cloud environments.

My skills in R and C++ are very limited. I can contribute with infrastructure analysis as it is my long journey in IT. My tests in R using parallel, makeCluster and parLapply worked fine (sending different sleep command to machine based in a vector of number). Each one have run in a different CPU and created only one thread as expected.

I could not identify why mcmc_cpp is creating too many threads.

aejen commented 1 year ago

We have found the same issue in our HPC system, i.e. the package uses 36 cores when only 3 were requested. We believe it stems from the use of parallel::detectCores() function, which returns the number of cores on the current machine, rather than the number of cores "allowed" / allocated for use. Is it possible to change parallel::detectCores() to use parallely::availableCores() instead? This function seems to better respect resource allocation requests. https://www.jottr.org/2022/12/05/avoid-detectcores/

drizopoulos commented 1 year ago

I made a change in the GitHub version. Could you give it a try?

aejen commented 1 year ago

Thank you! Yes, I will test it and post the results.

aejen commented 1 year ago

I apologize for the delayed reply. While this change did seem to help somewhat (i.e. not all 36 cores on the node are used), it still seems to somehow use 5-8 cores, even with setting cores=1, n_chains=1, regardless of how many cores were requested (either 1 or 3) in the allocation request for the job. I hope to do a bit more testing to see if I can pinpoint the reason for this.