charmplusplus / charm4py

Parallel Programming with Python and Charm++
https://charm4py.readthedocs.io
Apache License 2.0
288 stars 21 forks source link

Help with an error message when processing huge VTU file #183

Open Patol75 opened 3 years ago

Patol75 commented 3 years ago

Hi, I have a program written in Python which post-processes an output VTU file from a fluid dynamics framework. The program is massively parallel in the sense that it repeats the exact same calculations at different locations in space, distributed on a grid. I have used Charm4Py to parallelise the workload on a large, university-like cluster, especially the pool functionality as it seemed to be the most appropriate to me. Everything is working properly on "regular" VTU files, and I am obtaining results which compare really well to multiprocessing on a single node and Ray on a multi-node environment. However, I am encountering an issue when I provide a "huge" VTU. What I mean by huge is 3 GB, whereas other files are on the order of a few 100 MB. I am pasting below the output generated by Charm4Py and the traceback for the first PE; the computer I have run this on has 64 GB of RAM. I would be really grateful if anyone could help me with this error and explain to me what it means so that I can attempt to fix it. I am more than happy to provide additional information about the program in itself.

Thank you for any help.

Charmrun> scalable start enabled. 
Charmrun> started all node programs in 1.311 seconds.
Charm++> Running in non-SMP mode: 17 processes (PEs)
Converse/Charm++ Commit ID: v6.10.0-beta1-17-ga5b6b3259
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm4py> Running Charm4py version 1.0 on Python 3.8.0 (CPython). Using 'cython' interface to access Charm++
Charm++> Running on 1 hosts (1 sockets x 10 cores x 2 PUs = 20-way SMP)
Charm++> cpu topology info is gathered in 0.001 seconds.
Initializing charm.pool with 16 worker PEs. Warning: charm.pool is experimental (API and performance is subject to change)
----------------- Python Stack Traceback PE 2 -----------------
  File "charm4py/charmlib/charmlib_cython.pyx", line 863, in charm4py.charmlib.charmlib_cython.recvGroupMsg
  File "/home/thomas/.local/lib/python3.8/site-packages/charm4py/charm.py", line 253, in recvGroupMsg
    header, args = self.unpackMsg(msg, dcopy_start, obj)
  File "charm4py/charmlib/charmlib_cython.pyx", line 739, in charm4py.charmlib.charmlib_cython.CharmLib.unpackMsg
------------- Processor 2 Exiting: Called CmiAbort ------------
Reason: UnpicklingError: pickle data was truncated
[2] Stack Traceback:
  [2:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x60  [0x7f78509ea920]
  [2:1] +0x260a51  [0x7f78509eaa51]
  [2:2] +0x18e0e  [0x7f7850d0ae0e]
  [2:3] _PyObject_MakeTpCall+0x28f  [0x5fff6f]
  [2:4]   [0x4ffbbf]
  [2:5] _PyEval_EvalFrameDefault+0x53f0  [0x57dbb0]
  [2:6] _PyFunction_Vectorcall+0x19c  [0x602b2c]
  [2:7] _PyEval_EvalFrameDefault+0x88d  [0x57904d]
  [2:8] +0x11cd0  [0x7f7850d03cd0]
  [2:9] +0x13474  [0x7f7850d05474]
  [2:10] +0x16857  [0x7f7850d08857]
  [2:11] +0x40601  [0x7f7850d32601]
  [2:12] _ZN8GroupExt13__entryMethodEPvS0_+0xb5  [0x7f7850895445]
  [2:13] CkDeliverMessageFree+0x21  [0x7f7850898761]
  [2:14] _Z15_processHandlerPvP11CkCoreState+0x88a  [0x7f78508a18ea]
  [2:15] CsdScheduleForever+0x77  [0x7f78509aa717]
  [2:16] CsdScheduler+0x2d  [0x7f78509aa97d]
  [2:17] ConverseInit+0x64a  [0x7f78509f046a]
  [2:18] StartCharmExt+0x255  [0x7f785089edd5]
  [2:19] +0x3d7be  [0x7f7850d2f7be]
  [2:20] _PyObject_MakeTpCall+0x28f  [0x5fff6f]
  [2:21]   [0x4ffbbf]
  [2:22] _PyEval_EvalFrameDefault+0x53f0  [0x57dbb0]
  [2:23] _PyEval_EvalCodeWithName+0x25c  [0x5765ec]
  [2:24] _PyFunction_Vectorcall+0x442  [0x602dd2]
  [2:25] _PyEval_EvalFrameDefault+0x88d  [0x57904d]
  [2:26] _PyEval_EvalCodeWithName+0x25c  [0x5765ec]
  [2:27]   [0x662c2e]
  [2:28] PyRun_FileExFlags+0x97  [0x662d07]
  [2:29] PyRun_SimpleFileExFlags+0x17f  [0x663a1f]
  [2:30] Py_RunMain+0x20e  [0x687dbe]
  [2:31] Py_BytesMain+0x29  [0x688149]
  [2:32] __libc_start_main+0xe7  [0x7f78efe25bf7]
  [2:33] _start+0x2a  [0x607daa]
ZwFink commented 3 years ago

Hello, Thank you for reaching out. Did you install Charm4Py from pip or did you build Charm++ to use with Charm4Py manually? Also, are you distributing the entire 3GB data to each pool worker PE? Also, are you using charm.pool.map or some other functionality of the pool?

Patol75 commented 3 years ago

Thanks for having a look @ZwFink.

The error message provided results from a pip installation of Charm4Py. I have just tried with an MPI build of Charm++ (./build charm4py mpi-linux-x86_64 -j8 --with-production) and it yields a very similar error message:

Running on 17 processors:  /usr/bin/python3.8 script.py HUGE.vtu --charm 
charmrun>  /usr/bin/setarch x86_64 -R  mpirun -np 17  /usr/bin/python3.8 script.py HUGE.vtu --charm 
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: -1 (desired: 0)
Charm++> Running in non-SMP mode: 17 processes (PEs)
Converse/Charm++ Commit ID: v6.11.0-beta1-29-gd35885331
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm4py> Running Charm4py version 1.0 on Python 3.8.0 (CPython). Using 'cython' interface to access Charm++
Charm++> Running on 1 hosts (1 sockets x 10 cores x 2 PUs = 20-way SMP)
Charm++> cpu topology info is gathered in 0.009 seconds.
Initializing charm.pool with 16 worker PEs. Warning: charm.pool is experimental (API and performance is subject to change)
----------------- Python Stack Traceback PE 1 -----------------
  File "charm4py/charmlib/charmlib_cython.pyx", line 863, in charm4py.charmlib.charmlib_cython.recvGroupMsg
  File "/home/thomas/.local/lib/python3.8/site-packages/charm4py/charm.py", line 253, in recvGroupMsg
    header, args = self.unpackMsg(msg, dcopy_start, obj)
  File "charm4py/charmlib/charmlib_cython.pyx", line 739, in charm4py.charmlib.charmlib_cython.CharmLib.unpackMsg
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: UnpicklingError: pickle data was truncated
[1] Stack Traceback:
  [1:0] libcharm.so 0x7fff576fb6ec CmiAbortHelper(char const*, char const*, char const*, int, int)
  [1:1] libcharm.so 0x7fff576fb801 
  [1:2] charmlib_cython.cpython-38-x86_64-linux-gnu.so 0x7fff57a25dfe 
  [1:3] python3.8 0x5fff6f _PyObject_MakeTpCall
  [1:4] python3.8 0x4ffbbf 
  [1:5] python3.8 0x57dbb0 _PyEval_EvalFrameDefault
  [1:6] python3.8 0x602b2c _PyFunction_Vectorcall
  [1:7] python3.8 0x57904d _PyEval_EvalFrameDefault
  [1:8] charmlib_cython.cpython-38-x86_64-linux-gnu.so 0x7fff57a1ecc0 
  [1:9] charmlib_cython.cpython-38-x86_64-linux-gnu.so 0x7fff57a20464 
  [1:10] charmlib_cython.cpython-38-x86_64-linux-gnu.so 0x7fff57a23847 
  [1:11] charmlib_cython.cpython-38-x86_64-linux-gnu.so 0x7fff57a4d5f1 
  [1:12] libcharm.so 0x7fff576ba77b GroupExt::__entryMethod(void*, void*)
  [1:13] libcharm.so 0x7fff576256d4 CkDeliverMessageFree
  [1:14] libcharm.so 0x7fff5762e6d6 _processHandler(void*, CkCoreState*)
  [1:15] libcharm.so 0x7fff576bf491 CsdScheduleForever
  [1:16] libcharm.so 0x7fff576bf6fd CsdScheduler
  [1:17] libcharm.so 0x7fff576fdfaa ConverseInit
  [1:18] libcharm.so 0x7fff5762b91c StartCharmExt
  [1:19] charmlib_cython.cpython-38-x86_64-linux-gnu.so 0x7fff57a4a7ae 
  [1:20] python3.8 0x5fff6f _PyObject_MakeTpCall
  [1:21] python3.8 0x4ffbbf 
  [1:22] python3.8 0x57dbb0 _PyEval_EvalFrameDefault
  [1:23] python3.8 0x5765ec _PyEval_EvalCodeWithName
  [1:24] python3.8 0x602dd2 _PyFunction_Vectorcall
  [1:25] python3.8 0x57904d _PyEval_EvalFrameDefault
  [1:26] python3.8 0x5765ec _PyEval_EvalCodeWithName
  [1:27] python3.8 0x662c2e 
  [1:28] python3.8 0x662d07 PyRun_FileExFlags
  [1:29] python3.8 0x663a1f PyRun_SimpleFileExFlags
  [1:30] python3.8 0x687dbe Py_RunMain
  [1:31] python3.8 0x688149 Py_BytesMain
  [1:32] libc.so.6 0x7ffff7a03bf7 __libc_start_main
  [1:33] python3.8 0x607daa _start

Regarding data distribution, I am not fully sure, and it is highly possible I am not doing something ideal. Data from the VTU is read in the main function (the one given to charm.start()). It is then used to create 3-D Scipy Interpolator objects using, for example, NearestNDInterpolator. These objects are then passed to each function execution through charm.pool.

I am using the multi_future argument of map_async, making sure 60,000 futures at most are created, and I provide a function foo with a partial construct to pass a dictionary that holds variables which will be accessed by each process. These variables are never modified, they are only read or, in the case of the Interpolator objects mentioned above, called. I paste below the relevant snippet.

    if inputArgs.charm:  # Charm4Py
        nBatch = np.ceil(np.sum(varDict['indArr'] == 0) / 6e4).astype(int)
        for batch in range(nBatch):
            nodes2do = np.asarray(varDict['indArr'] == 0).nonzero()
            futures = charm.pool.map_async(
                partial(foo, dictGlobals=dictGlobals),
                list(zip(*[nodes[:60_000] for nodes in nodes2do])),
                multi_future=True)
            for future in charm.iwait(futures):
                output = future.get()
                for i, var in enumerate(outVar):
                    varDict[var][output[0]] = output[i + 1]
                varDict['indArr'][output[0][::-1]] = 1
                varDict['nodesComplete'] += 1