DaliangNing / iCAMP1

Infer Community Assembly Mechanisms by Phylogenetic bin-based null model analysis (Version 1)
GNU General Public License v2.0
68 stars 25 forks source link

Process in "D" state on NFS filesystem in a HPC environment #39

Closed melop closed 1 year ago

melop commented 1 year ago

Dear developer, I am an administrator of an HPC cluster. One of our users is running iCAMP on a cluster system running SLURM and ubuntu 22.04, with an installation of R 4.3.0. We notice that whenever iCAMP is called, it spawns many threads, some of which show a status of "D" (uninterruptable sleep). This has caused the NFS filesystem to be non responsive which affects other users. Do iCAMP or any of its dependent packages place file locks? If so on which file(s) and what happens if we disable file locking on the NFS partition? Thank you!

DaliangNing commented 1 year ago

iCAMP uses some functions (e.g., big.matrix) from the R package bigmemory and parallel computing function (e.g., parLapply) from package parallel. I do not know which will cause file locking. Two possible reasons: (1) Many functions in iCAMP can set parallel computing thread number by the parameter 'nworker', e.g., in icamp.big. If the user sets nworker as a very large number (over CPU cores), it will occupy too many threads and make the computer not responsive at all. (2) The big.matrix function used in some iCAMP function (e.g., pdist.big) saves a relatively big file with suffix '.bin' somewhere in the harddisk, and the physical address will be saved in a '.desc' file; then, the data in '.bin' file will be used as if it is saved in memory, thus the calculation can go beyond the limit of memory. The '.bin' file occupies a certain place which will not work if physically changed (copy/paste) to another place. But this is unlikely to make computer not responsive.

I used iCAMP a lot on our Linux and Windows servers without problems as long as I limit the nworker (thread number) lower than the CPU core number. But I have not deployed it in HPC yet.

DaliangNing commented 1 year ago

I hope the problem has been solved. If no more questions, I will close this issue.