Closed MilkaRadojevic closed 4 weeks ago
preloadQuantities is a const char* and should not be freed by loadHDF5. It's also handle as expected by _pyraveio_load. So that should not cause any problems as far as I can see. The LazyLoader tokenizes preloadQuantities and creates a list of quantities that are looped over and that code also looks to handle memory properly.
The filename should not be released by loadHDF, it's released when the RaveIO_t-instance is released which also seems to be handled as expected by the python-wrapper.
I have tried to force a segementation fault with the file you sent but I can't reproduce it. I would appreciate if you could provide me with: 1) A simple program using rave & the file you are having problem so that I can reproduce the segementation fault. 2) A stack trace from the SIGSEGV core dump so that I get a hint where the problem is.
Thanks Anders for this clarification.
First, some questions related to 'The LazyLoader tokenizes preloadQuantities and creates a list of quantities that are looped over':
- A simple program using rave & the file you are having problem so that I can reproduce the segementation fault.
I use a set of BALTRAD libraries with rave and complie into the executable odc_toolbox
.
I can share with you the compiled version or if you want the toolbox so you can try to compile by yourself.
- A stack trace from the SIGSEGV core dump so that I get a hint where the problem is.
Here below is trace from /var/log/messages-20240818 as provided to me by our IT team.
This when the system started first to freeze (execution date-time: 20240816111500).
You will also find down below snipped from the corresponding production log.
If you look at timing you will see that the log points out to a moment when rave_io.c:486
fails.
Aug 16 11:17:01 sotrtm13-sine systemd[1]: Started Session c33896 of user soprajob.
Aug 16 11:17:02 sotrtm13-sine systemd[1]: session-c33896.scope: Succeeded.
Aug 16 11:17:08 sotrtm13-sine kernel: python3[77990]: segfault at 28ad000 ip 0000149c6eb43ce3 sp 00007ffeb9eff4e8 error 4 in libc-2.28.so[149c6ea74000+1bc000]
Aug 16 11:17:08 sotrtm13-sine kernel: Code: 17 e0 c5 f8 77 c3 48 3b 15 da 48 2f 00 0f 83 25 01 00 00 48 39 f7 72 0f 74 12 4c 8d 0c 16 4c 39 cf 0f 82 c5 01 00 00 48 89 d1 <f3> a4 c3 80 fa 10 73 17 80 fa 08 73 27 80 fa 04 73 33 80 fa 01 77
Aug 16 11:17:08 sotrtm13-sine systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Aug 16 11:17:08 sotrtm13-sine systemd[1]: Started Process Core Dump (PID 78001/UID 0).
Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Resource limits disable core dumping for process 77990 (python3).
Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Process 77990 (python3) of user 65374 dumped core.
Aug 16 11:17:08 sotrtm13-sine systemd[1]: systemd-coredump@0-78001-0.service: Succeeded.
Aug 16 11:17:12 sotrtm13-sine systemd[1]: Stopping User Manager for UID 65374...
A snipped from production log:
bin/odc_toolbox -i /disque_local/utmp/sms/E/tache_19682626/toolbox_input -o /disque_local/utmp/sms/E/tache_19682626 /composite_input --qc hac-increment,ropo,beamb,satfilter,hac-filter,qi-total --procs 32 --verbose
2024/08/16 11:17:08 : ERROR : Preloading of file failed: /disque_local/utmp/sms/E/tache_19682626/toolbox_input/T_PAGZ54_C_LYBM_20240816111006.hdf (rave_io.c:486)
Thanks Anders for this clarification.
First, some questions related to 'The LazyLoader tokenizes preloadQuantities and creates a list of quantities that are looped over':
- What happens when the reading of one quantity fails? A given list of quantities is then broken.
It depends but in this situation, the actual reading of data is performed inside hlhdf. It will load all nodes until it fails to read some data and in that case it will return with an error. Since the preload fails, the lazy loader will be released an all data associated with the HDF5-file should be released.
- As a result, the code should exit the loop, release the memory and close a file, right?
Yes. The file will be closed. The filename is however associated to the RaveIO-instance and will not be released until the instance is destroyed. Either by the python gc when a = _raveio.new(), a.filename=..., a.load() is used or else directly if _raveio.open("...") fails.
- Somehow I get the impression from this experience that it gets blocked somewhere.
Hard to say what it can be, maybe something with the filesystem or some global lock in python !?
- A simple program using rave & the file you are having problem so that I can reproduce the segementation fault.
I use a set of BALTRAD libraries with rave and complie into the executable
odc_toolbox
. I can share with you the compiled version or if you want the toolbox so you can try to compile by yourself.
- A stack trace from the SIGSEGV core dump so that I get a hint where the problem is.
Here below is trace from /var/log/messages-20240818 as provided to me by our IT team. This when the system started first to freeze (execution date-time: 20240816111500). You will also find down below snipped from the corresponding production log. If you look at timing you will see that the log points out to a moment when
rave_io.c:486
fails.Aug 16 11:17:01 sotrtm13-sine systemd[1]: Started Session c33896 of user soprajob. Aug 16 11:17:02 sotrtm13-sine systemd[1]: session-c33896.scope: Succeeded. Aug 16 11:17:08 sotrtm13-sine kernel: python3[77990]: segfault at 28ad000 ip 0000149c6eb43ce3 sp 00007ffeb9eff4e8 error 4 in libc-2.28.so[149c6ea74000+1bc000] Aug 16 11:17:08 sotrtm13-sine kernel: Code: 17 e0 c5 f8 77 c3 48 3b 15 da 48 2f 00 0f 83 25 01 00 00 48 39 f7 72 0f 74 12 4c 8d 0c 16 4c 39 cf 0f 82 c5 01 00 00 48 89 d1 <f3> a4 c3 80 fa 10 73 17 80 fa 08 73 27 80 fa 04 73 33 80 fa 01 77 Aug 16 11:17:08 sotrtm13-sine systemd[1]: Created slice system-systemd\x2dcoredump.slice. Aug 16 11:17:08 sotrtm13-sine systemd[1]: Started Process Core Dump (PID 78001/UID 0). Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Resource limits disable core dumping for process 77990 (python3). Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Process 77990 (python3) of user 65374 dumped core. Aug 16 11:17:08 sotrtm13-sine systemd[1]: systemd-coredump@0-78001-0.service: Succeeded. Aug 16 11:17:12 sotrtm13-sine systemd[1]: Stopping User Manager for UID 65374...
Yes, it looks like it's a python3-binary that is crashing. The coredump is available using for example the coredumpctl command. Then after the coredump has been retrieved it should be possible to write "gdb python3 core.
". After that "bt" to get the backstrace (r something similar). A snipped from production log:
bin/odc_toolbox -i /disque_local/utmp/sms/E/tache_19682626/toolbox_input -o /disque_local/utmp/sms/E/tache_19682626 /composite_input --qc hac-increment,ropo,beamb,satfilter,hac-filter,qi-total --procs 32 --verbose 2024/08/16 11:17:08 : ERROR : Preloading of file failed: /disque_local/utmp/sms/E/tache_19682626/toolbox_input/T_PAGZ54_C_LYBM_20240816111006.hdf (rave_io.c:486)
Have you got a running odc_toolbox on the ODE or similar that I can access?
It depends but in this situation, the actual reading of data is performed inside hlhdf.
Indicating that something might failed inside hlhdf
.
it looks like it's a python3-binary that is crashing.
I shall report this to our IT.
I run tests on my dev server (supposed to be a copy of production nodes) but I cannot see the segfault
; rave_pgf.log shows that file is deleted.
[]$ date; /home/mdso/radojevicm/dso_cirrus/toolbox3/bin/odc_toolbox -i input_lybm -o output_lybm --qc 'hac-increment,ropo,beamb,satfilter,hac-filter,qi-total' --verbose
ven. août 23 09:51:13 GMT 2024
2024/08/23 09:51:13 : ERROR : Preloading of file failed: input_lybm/T_PAGZ54_C_LYBM_20240821081006.hdf (rave_io.c:486)
Objects created: 1509
Objects deleted: 1509
Objects pending: 0
[]$ more rave_pgf.log
2024-08-23 09:51:13,758 ERROR ForkPoolWorker-1: T_PAGZ54_C_LYBM_20240821081006.hdf: Failed to open file - Deleting file.
2024-08-23 09:51:13,766 INFO No statistics
Have you got a running odc_toolbox on the ODE or similar that I can access?
We don't use ODE, it does not serve us for development purposes.
The source of CIRRUS is in https://github.com/opera-radar/opera-cirrus-operational
The execution of odc_toolbox
in https://github.com/opera-radar/opera-cirrus-operational/blob/master/scripts/lance_cirrus.sh#L228
Thank you Anders.
It depends but in this situation, the actual reading of data is performed inside hlhdf.
Indicating that something might failed inside
hlhdf
.
Yes, it can't read the data I mentioned earlier which also h5dump fails to do. However, there is nothing in hlhdf indicating that the problem you are experiencing originates from hlhdf and without a way for me to reproduce the problem it's almost impossible to find the problem.
it looks like it's a python3-binary that is crashing.
I shall report this to our IT. I run tests on my dev server (supposed to be a copy of production nodes) but I cannot see the
segfault
; rave_pgf.log shows that file is deleted.[]$ date; /home/mdso/radojevicm/dso_cirrus/toolbox3/bin/odc_toolbox -i input_lybm -o output_lybm --qc 'hac-increment,ropo,beamb,satfilter,hac-filter,qi-total' --verbose ven. août 23 09:51:13 GMT 2024 2024/08/23 09:51:13 : ERROR : Preloading of file failed: input_lybm/T_PAGZ54_C_LYBM_20240821081006.hdf (rave_io.c:486) Objects created: 1509 Objects deleted: 1509 Objects pending: 0 []$ more rave_pgf.log 2024-08-23 09:51:13,758 ERROR ForkPoolWorker-1: T_PAGZ54_C_LYBM_20240821081006.hdf: Failed to open file - Deleting file. 2024-08-23 09:51:13,766 INFO No statistics
Just a speculation, since you mentioned that you are loosing memory. Can there be something related to multiprocessing causing a problem with hanging processes. This is something we have experienced more than once if an instance of a child processes dies unexpectedly leaving the complete tiling waiting for ever. I believe I have mentioned this as a possible problem a few months ago.
Have you got a running odc_toolbox on the ODE or similar that I can access?
We don't use ODE, it does not serve us for development purposes. The source of CIRRUS is in https://github.com/opera-radar/opera-cirrus-operational The execution of
odc_toolbox
in https://github.com/opera-radar/opera-cirrus-operational/blob/master/scripts/lance_cirrus.sh#L228Thank you Anders.
I wrote a simple test program that only tries to open the file in a loop to see if there is any kind of memory leak. top didn't show any indication that memory was increasing which would have been the case if there was an obvious memory problem related to erroneous reads in hlhdf, raveio or the lazy-loader. The code didn't crash either. Not during execution or when the program exited.
import _rave, _raveio if name=="main": _rave.setDebugLevel(_rave.Debug_RAVE_SPEWDEBUG) for i in range(0, 10000): try: a=_raveio.open("T_PAGZ54_C_LYBM_20240821081006.hdf").object except: pass
Thank you very much for all tests and for sharing the ideas.
Can there be something related to multiprocessing causing a problem with hanging processes. This is something we have experienced more than once if an instance of a child processes dies unexpectedly leaving the complete tiling waiting for ever.
It's possible.
I noticed before that the function that handles multiprocessing
in rave is quite basic and the same as in era of Python 2.
Do you think it would be worth of checking that out and making it more up to date with more try-except conditions?
Yes, we did some modifications to the tiled compositing to prevent problems with crashing processes. It should be possible to do similar modification of the code used by the odc_toolbox (odc_polarQC.py).
Greetings @MilkaRadojevic,
Based on the information provided, our ability to diagnose the issue is limited due to the absence of a core dump and a detailed stack trace. Additionally, since this issue involves a Python process encountering a segmentation fault ensure that debug information is available. This will allow us to see the relevant names and symbols in the core dump.
Here is what we know so far:
28ad000
, triggered by the instruction pointer (IP) at
0000149c6eb43ce3
, with the last stack pointer address being 00007ffeb9eff4e8
.To assist you further, please enable core dumps on your system and gather a full Bill of Materials (BoM) for the software involved.
Optional: Ensure that Python and any relevant C extensions are compiled with debugging symbols. This is typically achieved by installing the -debuginfo packages for the relevant libraries and binaries. For example, on RPM-based systems, you might need to install the debug information for Python:
sudo dnf install python3-debuginfo
If you have custom-built Python extensions or other relevant C libraries, ensure they are compiled with the -g option to include debugging information.
The core dump generation is controlled by the core
resource limit, which determines the maximum size of a core dump file.
ulimit -c
ulimit -c unlimited
to the script you are executing to ensure core dumps are enabled during the execution.ulimit -c unlimited
in the terminal to apply this setting for your current session./etc/security/limits.conf
:
/etc/security/limits.conf
file to set the core dump size persistently./etc/sysctl.conf
:
/etc/sysctl.conf
file to specify where core dumps should be saved:
kernel.core_pattern=/tmp/core_%e.%p.%h.%t
kernel.core_uses_pid = 1
kernel.core_pattern
specifies the location and format of the core dump file.kernel.core_uses_pid
appends the process ID to the core dump file name.sudo sysctl -p
kill -SIGSEGV $$
systemd
core dump service settings
systemd-coredump
service handles core dumps in systemd-based systems.sudo systemctl status systemd-coredump
Alternatively, edit /etc/systemd/coredump.conf
for custom settings.
We hope this information helps in generating a core dump. If the issue persists, please ensure that no additional security mechanisms, such as SELinux or AppArmor, are blocking core dumps.
For more detailed guidance, you can refer to Red Hat's documentation on core dumps. https://access.redhat.com/solutions/4896
To help us diagnose the issue more effectively, please provide a comprehensive list of the software and libraries involved, including their versions.
This should include:
cat /etc/os-release
).python --version
to obtain this information.dnf
, yum
, or rpm
:
rpm -qa | grep <package_name>
pip freeze
Please ensure that the information is comprehensive and corresponds to the environment where the segmentation fault occurs. Double-check that the process (PID) experiencing the segmentation fault is running the expected code and that all relevant libraries are correctly identified.
Please send this information via email or another secure method of communication if it is considered sensitive.
Do not post the core dump publicly on GitHub or any other public forum. Instead, please email the core dump directly to us or use another secure method of communication.
Yours truly,
I.E. - SMHI.
Thank you @IE-smhi for such detailed analysis of presented problem.
I will ask our IT team responsible for the operational environment at Météo-France to provide us with a core dump and a detailed stack trace, and if still available it will be send to Anders via email. Meanwhile, I will forward your analysis and recommendations to them since the access to the production platform is limited to us developers.
The answers I can provide to you:
Operating System: Specify the exact version (cat /etc/os-release).
Rocky Linux 8.5
Python version: Run python --version to obtain this information.
Python 3.9
RAVE version: Please specify the exact version of the RAVE library being used.
Tag 38 + python script from Anders (python39.patch) to allow for compilation of rave at Météo-France
Other dependencies: List any additional libraries or tools involved.
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)
Best regards, Milka
Closing issue until further information is available.
In case when rave fails to read
data
from/dataset1/data1
, as given in the example below for a HDF5-ODIM file, the error leads to an python segfault and disabled core dump. As a result, the memory ressources are exhausted andodc_baltrad
is freezes (does not complete).It seems that the condition if (!LazyNodeListReader_preload(lazyReader))) is not well handled. The command
go to
does not release memory forpreloadQuantities
neither forfilename
.I did not look for more details in
rave_io.c
but I have impression thatfilename
is open but I do not see where it is closed when the reading ofdata
is not successfull.Thank you.