devitocodes / devito

DSL and compiler framework for automated finite-differences and stencil computation
http://www.devitoproject.org
MIT License
554 stars 225 forks source link

Temporary increase in memory when executed pyrevolve.Operator leads to memory error #1374

Open ofmla opened 4 years ago

ofmla commented 4 years ago

A python script running a TTI RTM for only one shot with pyrevolve (https://sesibahia-my.sharepoint.com/:u:/g/personal/oscar_ladino_fieb_org_br/EWpX_VT4U3lCqdAArLVaGKABe9oysSb0KKRDlqIyL1XpwA?e=TgKdL4) runs fine for a period of time before crashing with the following error:

  File "/home/oscarm/.conda/envs/devito-v4.2.2/lib/python3.8/site-packages/pytools/prefork.py", line 49, in call_capture_output
    popen = Popen(cmdline, cwd=cwd, stdin=PIPE, stdout=PIPE,
  File "/home/oscarm/.conda/envs/devito-v4.2.2/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/oscarm/.conda/envs/devito-v4.2.2/lib/python3.8/subprocess.py", line 1637, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
OSError: [Errno 12] Cannot allocate memory

It seems to be that forking to call the compiler when applying the pyrevolve.Operator temporarily doubles the memory of the parent process. If the parent process is using more than half of the system memory, it will lead to run into a memory error. A workaround is to precompile the code before instantiate the CheckpointOperator object, i.e

cp = DevitoCheckpoint([u,v])
op_fwd = solver.op_fwd(save=False)
op_fwd.cfunction
op_imaging.cfunction
wrap_fw = CheckpointOperator(op_fwd, src=geometry.src, u=u, v=v, ... )
wrap_rev = CheckpointOperator(op_imaging, u=u, v=v, ... )

However, it would appear that there is another possible solution as suggested in this thread https://devitocodes.slack.com/archives/C7JMLMSG0/p1593739206410500?thread_ts=1593727821.408500&cid=C7JMLMSG0

navjotk commented 4 years ago

I think the current evidence points at the fork before compilation being the culprit.

FabioLuporini commented 3 years ago

@mloubout @tjb900 ever noticed anything like this?

tjb900 commented 3 years ago

Yeah, definitely

It's worth noting that the physical memory usage doesn't increase as both processes refer to the same physical memory, but if there is anything in place that limits the total virtual memory footprint of a group of processes or all processes on the system, this kind of thing can trip it up (e.g. the value of /proc/sys/vm/overcommit_ratio).

Possible workarounds are:

Finally, in answering this I have noticed that codepy uses pytools.prefork to actually spawn the compiler, and this module is explicitly designed to avoid some of the above issues. It does this by supporting forking a "fork server" early in the process before e.g. MPI is initialised or large memory allocations occur. And then the compiler processes are forked from that tiny process rather than the application process itself.

See https://github.com/inducer/pytools/blob/main/pytools/prefork.py

It looks to me like the intention of the above module is that calling pytools.prefork.enable_prefork() very early in your application might sidestep some of the above issues quite neatly.

FabioLuporini commented 3 years ago

That's really a nice comprehensive answer. Thanks a lot @tjb900 .

ccuetom commented 3 years ago

We have been seeing the same memory errors in Stride when compiling certain operators. After doing some tests, it seems calling pytools.prefork.enable_prefork() early solves the compilation problem.

However, the problem persists if the compiler or the MPI configuration is changed when memory use is high. That is because in these cases Devito uses subprocess.check_output to sniff the available compilers, which calls subprocess.Popen directly instead of using pytools.

tjb900 commented 3 years ago

Yeah, I've just been running into this lately as well. Seems like it might be a good idea to switch the compiler/mpi/gpu sniffs to use pytools.

Note that along the same lines, there is also an issue with the allocators initializing - ctypes.util.find_library uses subprocess to do some pretty hacky stuff. That will be harder to fix, but I guess applications that run into this problem can manually initialise the allocators early.

FabioLuporini commented 1 year ago

these are memoized now, is it still an issue?