3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 198 forks source link

RELION-3.1 Pre-read all particles into RAM #514

Closed kaoweichun closed 4 years ago

kaoweichun commented 4 years ago

Hello,

I used Relion 3.1-beta-commit-a6aaa5 to repeat a previously completed 3D auto-refine from Relion 3.0.7 without changing any settings. I ran it on a single machine (specs see below) and I always enabled Pre-read all particles into RAM. The typical behaviour on 3.0.7 is, one master MPI proc uses 250 GB RAM, and each slave MPI proc uses some 20 GB RAM. However, in Relion 3.1, upon starting refinement the master MPI proc used up all the RAM capacities till mpirun crashed, before estimating initial noise spectra (update: and each slave MPI proc used only around 4 GBM RAM in this stage).

By observing the aforementioned RAM usage behaviour of Relion 3.1, I simply disabled Pre-read all particles into RAM and I could avoid this problem and the refinement could proceed, but I am afraid that there is an issue with memory usage.

Thanks,

WCK


Computer brief specs: 48 cores (hyperthreading-enabled)/ 384 GB RAM/ 2 TB SSD/ Open MPI 3.0.2/ GTX 1080 Ti (4x)/ sge/ CentOS 7.6/ CUDA version 10.1

biochem-fan commented 4 years ago

Are you sure you used the same Use parallel disc I/O? setting?

kaoweichun commented 4 years ago

Yes. Use parallel disc I/O? is always off. Otherwise each slave MPI proc will use the same amount of RAM as the master MPI proc does without Use parallel disc I/O? I suppose?

biochem-fan commented 4 years ago

This is puzzling... We didn't change codes related to preread images to RAM. Does this happen on other datasets as well?

kaoweichun commented 4 years ago

Yes, it happened to other datasets and on another computer as well. That computer has even more ram (768 GB) and Relion 3.1 still attempted to fill it with the Master MPI proc (each Slave MPI procs use roughly < 10 GB RAM). It uses Open MPI 2.1.1 and 2x Tesla M60 so I can say the issue is somehow independent from the systems I am using (?).

ashkumatov commented 4 years ago

Hi. I am having similar issue.. I was monitoring the RAM usage now (relion3.1) and before (<3.1 versions). In 3.1b the particles are read to RAM and then at a first step (maximization) suddenly RAM is filled up to it's max and there is a longer waiting.. In subsequent steps, after RELION prints Maximization is done in XX seconds, the RAM is again filled up to the max (actually more than NCPU x 'du -hs Extract/jobXXX' and there is like 5-10 min waiting till RELION is going to next iteration. THe same dataset reading from disk - no waiting.. In 3.0stable there was no issue like that.

biochem-fan commented 4 years ago

What are the box size, pixel size and resolution?

Are you familiar with gdb? Can you investigate what RELION is doing during "5-10 min waiting"? Check the process ID of one of MPI processes (not master), attach gdb by gdb -p processID and do bt (backtrace).

reading from disk

Is this from the scratch, or from the original location?

ashkumatov commented 4 years ago

What are the box size, pixel size and resolution? it's never been an issue before.. But ok.. After 3x decimation it's 200px, 0.7*3A/px and it's a 2D classification.. After that exactly the same during 3D classification.

Can you investigate what RELION is doing during "5-10 min waiting"? Attaching to process 38416 [New LWP 38420] [New LWP 38423] [New LWP 38575] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f5fb204d093 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

Is this from the scratch, or from the original location? i don't think it's an issue, but essentially i move the extracted job to /ssd and then just create a softlink in the relion dir -works fine in RELION3.0s

biochem-fan commented 4 years ago

Can you do bt (backtrace) in GDB?

ashkumatov commented 4 years ago

as in 'gdb bt -p XX' ? not really familiar with gdb..

ashkumatov commented 4 years ago

btw, i rolled back to relion3.0s on the same gpu station - works flawless..

biochem-fan commented 4 years ago

After gdb -p XX, it will show a prompt (gdb). Please type bt there.

ashkumatov commented 4 years ago

[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. (gdb) (gdb) bt

0 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29

1 0x00007f7e29270403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

2 0x00007f7e2926760b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

3 0x000055a5b0e984a3 in ?? ()

4 0x000055a5b0e96aea in ?? ()

5 0x00007f7e28c13b97 in __libc_start_main (main=0x55a5b0e96aca, argc=43, argv=0x7ffd6eedc358, init=,

fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd6eedc348) at ../csu/libc-start.c:310

6 0x000055a5b0e969ea in ?? ()

(gdb)

0 0x00007f7e28d06bf9 in __GI___poll (fds=0x55a5b2dc5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29

1 0x00007f7e29270403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

2 0x00007f7e2926760b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

3 0x000055a5b0e984a3 in ?? ()

4 0x000055a5b0e96aea in ?? ()

5 0x00007f7e28c13b97 in __libc_start_main (main=0x55a5b0e96aca, argc=43, argv=0x7ffd6eedc358, init=,

fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd6eedc348) at ../csu/libc-start.c:310

6 0x000055a5b0e969ea in ?? ()

biochem-fan commented 4 years ago

Thanks. Can you try the same with other MPI processes?

ashkumatov commented 4 years ago

sure. my command: which relion_refine_mpi --o Class2D/job037/run --i Extract/job019/particles.star --dont_combine_weights_via_disc --preread_images --pool 300 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 280 --K 40 --flatten_solvent --zero_mask --strict_highres_exp 8 --oversampling 1 --psi_step 12 --offset_range 20 --offset_step 4 --norm --scale --j 1 --gpu "0:1" --pipeline_control Class2D/job037/ After it executed it gets stuck after

root@jekyll:/home# for i in ps -aux | grep emuser | grep relion_refine | awk {'print $2'}; do echo $i; done 6459 6468 6469 6470 root@jekyll:/home# gdb -p 6459 GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 6459 [New LWP 6464] [New LWP 6465] [New LWP 6466] [New LWP 6467] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f72ce84dbf9 in __GI___poll (fds=0x5610720c5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. (gdb) bt

0 0x00007f72ce84dbf9 in __GI___poll (fds=0x5610720c5360, nfds=16, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29

1 0x00007f72cedb7403 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

2 0x00007f72cedae60b in opal_libevent2022_event_base_loop () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.20

3 0x00005610706d14a3 in ?? ()

4 0x00005610706cfaea in ?? ()

5 0x00007f72ce75ab97 in __libc_start_main (main=0x5610706cfaca, argc=43, argv=0x7ffe718cd4e8, init=,

fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe718cd4d8) at ../csu/libc-start.c:310

6 0x00005610706cf9ea in ?? ()

(gdb) quit A debugging session is active.

Inferior 1 [process 6459] will be detached.

Quit anyway? (y or n) y Detaching from program: /usr/bin/orterun, process 6459 [Inferior 1 (process 6459) detached] root@jekyll:/home# gdb -p 6468 GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 6468 [New LWP 6471] [New LWP 6472] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". b0x00007f70486024b9 in __brk (addr=0x56499416a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31 31 ../sysdeps/unix/sysv/linux/x86_64/brk.c: No such file or directory. (gdb) bt

0 0x00007f70486024b9 in __brk (addr=0x56499416a000) at ../sysdeps/unix/sysv/linux/x86_64/brk.c:31

1 0x00007f7048602591 in __GI___sbrk (increment=159744) at sbrk.c:56

2 0x00007f7048587199 in __GI___default_morecore (increment=) at morecore.c:47

3 0x00007f704857fdac in sysmalloc (nb=nb@entry=160016, av=av@entry=0x7f70488d7c40 ) at malloc.c:2489

4 0x00007f7048580ff0 in _int_malloc (av=av@entry=0x7f70488d7c40 , bytes=bytes@entry=160000) at malloc.c:4125

5 0x00007f70485832ed in __GI___libc_malloc (bytes=160000) at malloc.c:3065

6 0x00007f7049155258 in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

7 0x0000563f849532c3 in MultidimArray::resize(long, long, long, long) ()

8 0x0000563f84953ce3 in ExpImage::ExpImage(ExpImage const&) ()

9 0x0000563f849548d0 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()

10 0x0000563f84960101 in void std::stable_sort<__gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, gnu_cxx::normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

11 0x0000563f8494fd99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()

12 0x0000563f848cc14b in MlOptimiserMpi::iterate() ()

13 0x0000563f848895a7 in main ()

(gdb) quit A debugging session is active.

Inferior 1 [process 6468] will be detached.

Quit anyway? (y or n) y Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6468 [Inferior 1 (process 6468) detached] root@jekyll:/home# gdb -p 6469 GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 6469 [New LWP 6475] [New LWP 6476] [New LWP 6498] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249 249 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory. (gdb) bt

0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249

1 0x000055d22dfd1a65 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()

2 0x000055d22dfdb497 in gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > > std::move_merge<ExpParticle, __gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(ExpParticle, ExpParticle, ExpParticle, ExpParticle*, gnu_cxx::normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

3 0x000055d22dfdc036 in void std::merge_sort_with_buffer<__gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)> >(gnu_cxx::normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

4 0x000055d22dfdcf5b in void std::stable_sort_adaptive<gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, long, __gnu_cxx::ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, long, gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

5 0x000055d22dfdd15a in void std::stable_sort<__gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, gnu_cxx::normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

6 0x000055d22dfccd99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()

7 0x000055d22df4914b in MlOptimiserMpi::iterate() ()

8 0x000055d22df065a7 in main ()

(gdb) quit A debugging session is active.

Inferior 1 [process 6469] will be detached.

Quit anyway? (y or n) y Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6469 [Inferior 1 (process 6469) detached] root@jekyll:/home# gdb -p 6470 GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 6470 [New LWP 6473] [New LWP 6474] [New LWP 6499] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249 249 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory. (gdb) bt

0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:249

1 0x00005643a715aa65 in std::vector<ExpImage, std::allocator >::operator=(std::vector<ExpImage, std::allocator > const&) ()

2 0x00005643a7164b29 in void std::merge_sort_with_buffer<__gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)> >(gnu_cxx::normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, __gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

3 0x00005643a7165f5b in void std::stable_sort_adaptive<gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, long, __gnu_cxx::ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(__gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, gnu_cxx::__normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, ExpParticle, long, gnu_cxx::__ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

4 0x00005643a716615a in void std::stable_sort<__gnu_cxx::normal_iterator<ExpParticle, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::__ops::_Iter_comp_iter<bool ()(ExpParticle, ExpParticle)> >(gnu_cxx::__normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, gnu_cxx::normal_iterator<ExpParticle*, std::vector<ExpParticle, std::allocator > >, __gnu_cxx::ops::_Iter_comp_iter<bool (*)(ExpParticle, ExpParticle)>) ()

5 0x00005643a7155d99 in Experiment::randomiseParticlesOrder(int, bool, bool) ()

6 0x00005643a70d214b in MlOptimiserMpi::iterate() ()

7 0x00005643a708f5a7 in main ()

(gdb) quit A debugging session is active.

Inferior 1 [process 6470] will be detached.

Quit anyway? (y or n) y Detaching from program: /home/software/relion/git-relion-3.1_beta/build-relion3.1_beta-20191025_cu92/bin/relion_refine_mpi, process 6470 [Inferior 1 (process 6470) detached]

biochem-fan commented 4 years ago

Thanks. This is very useful.

Another question: how many particles do you have?

ashkumatov commented 4 years ago

Another question: how many particles do you have? moderate amount = 148k

biochem-fan commented 4 years ago

OK, probably I understood what is happening.

How many optics groups do you have? If you have only one: does --random_seed 0 make it faster?

ashkumatov commented 4 years ago
# version 30001

data_optics

loop_ 
_rlnOpticsGroupName #1 
_rlnOpticsGroup #2 
_rlnMicrographOriginalPixelSize #3 
_rlnVoltage #4 
_rlnSphericalAberration #5 
_rlnAmplitudeContrast #6 
_rlnImagePixelSize #7 
_rlnImageSize #8 
_rlnImageDimensionality #9 
opticsGroup1            1     0.784000   300.000000     2.550000     0.100000     2.352000          200            2 
# version 30001

data_particles

loop_ 
ashkumatov commented 4 years ago

--random_seed 0 i will check later.. But if it's the case with optics groups, would i not had the same issue when ptcls are not read into RAM?

biochem-fan commented 4 years ago

The problem seems to be the sorting of particles by optics groups. https://github.com/3dem/relion/blob/ver3.1/src/exp_model.cpp#L394 When particles are pre-read into RAM, the ExpImage objects become larger and take more time to be copied and might lead to memory fragmentation.

This sorting was not present in RELION 3.0. --random_seed 0 prevents calls to this function.

ashkumatov commented 4 years ago

i see. Thanks!

biochem-fan commented 4 years ago

I made an improvement to the code; can you test it without setting --random_seed 0?

biochem-fan commented 4 years ago

The latest version on the repository should fix this issue. If not, please reopen this issue.

ashkumatov commented 4 years ago

in the latest RELION-3.1 (commit 9d7525), when reading ptcls to RAM, one still has to specify "--random_seed 0", otherwise it "eats" up all RAM and stalls.. If read from disk - normal behaviour.

biochem-fan commented 4 years ago

Although the commit 76fa3d28 reduced the memory usage, nonetheless RELION 3.1 needs more space and operations than 3.0 due to optics groups. I don't think we can reduce further.

If you have plenty of RAM, you can make a RAM disk and use it as a scratch space.

biochem-fan commented 4 years ago

Note to self:

biochem-fan commented 4 years ago

@nym2834610, @ashkumatov What is the number of particles? Which compiler did you use?

nym2834610 commented 4 years ago

~200 K particles at 1A/pix, box size 400. We use the bash shell.

biochem-fan commented 4 years ago

What is the compiler, not the shell?

ashkumatov commented 4 years ago

Actually I tried with 6k ptcls, which is about 10Gb. Running it on 4GPUs, using 5CPUs total. If I don’t use the flag “--random_seed 0“, RAM consumption goes up to 240Gb.. and at the peak consumption steps, there’s a really long waiting. if I use the flag, it’s a standard behaviour.

I will check a compiler version on Monday.

biochem-fan commented 4 years ago

Is the memory consumption more or less proportional to the number of the particles? How much does it use with 3K particles, for example?

ashkumatov commented 4 years ago

It loads into ram proportional amount and then at certain steps it goes to max RAM available..

biochem-fan commented 4 years ago

Does it use all RAM and take very long even with say 100 particles?

ashkumatov commented 4 years ago

Thanks for you comment! It actually helped me to find the problem: i typically compile two version of RELION - with CUDA8.0 (to be able to run GCTF) and CUDA9.2, which require different version of C complier. Basically, i forgot to switch back to newer C compiler when compiling with CUDA9.2 Now all works. Thanks for your help!

biochem-fan commented 4 years ago

@ashkumatov Can you comment on which compiler works and which does not?

nym2834610 commented 4 years ago

We use the CUDA 7.5 complier. I'll try other complier versions on Monday and let you know if the problem is gone without randomseed to 0.

biochem-fan commented 4 years ago

What is the version of GCC invoked by your CUDA compiler (nvcc)?

ashkumatov commented 4 years ago

@biochem-fan actually, i did more tests - still the problem is there. I load to RAM 90Gb of ptcls for 2D classification and at some steps the RAM gets filled up to 180Gb.. so essentially doubles.

biochem-fan commented 4 years ago

I think double are reasonable. We need space to move particles arounds.

But earlier you said 10Gb of particles consume "up to 240Gb", which is quite unexpected and something I cannot reproduce locally. Does the memory consumption different between GCC versions (I don't care CUDA versions)? Does it still happen with very very few particles, say 100?

When particles (ExpParticle) are sorted, they are copied. Of course, old particles are freed but memory might get fragmented. Depending on the compiler, malloc and/or std::sort is less efficient and can take more memory and time. By using move semantics in C++11, we can explicitly ask the compiler to move objects, instead of copy and delete, thus saving time and space. This is better but takes huge efforts to implement and test. Unless I can locally reproduce this problem, I cannot investigate further.

biochem-fan commented 4 years ago

@ashkumatov @nym2834610 @kaoweichun In the latest commit 6d9a0da, we improved memory management. In our local tests, the huge spike in the memory usage was eliminated and the time after the M step and the next E step has shortened. Could you please test?

eariascib commented 4 years ago

We also had large peaks of RAM usage that stalled the jobs (relion 3.1 downloaded on March 5). The new version solved these issues. Thanks!