baltrad / rave

Radar product generation framework and toolbox. Injector using ODIM_H5 files.
Other
3 stars 3 forks source link

Preloading of file failed leads to segfault and disabled core dump #55

Closed MilkaRadojevic closed 4 weeks ago

MilkaRadojevic commented 3 months ago

In case when rave fails to read data from /dataset1/data1, as given in the example below for a HDF5-ODIM file, the error leads to an python segfault and disabled core dump. As a result, the memory ressources are exhausted and odc_baltrad is freezes (does not complete).

ERROR : Preloading of file failed: /data2/cirrus_20240821081500/toolbox_input/T_PAGZ54_C_LYBM_20240821081006.hdf (rave_io.c:486)

It seems that the condition if (!LazyNodeListReader_preload(lazyReader))) is not well handled. The command go to does not release memory for preloadQuantities neither for filename.

I did not look for more details in rave_io.c but I have impression that filename is open but I do not see where it is closed when the reading of data is not successfull.

Thank you.

andershenja commented 3 months ago

preloadQuantities is a const char* and should not be freed by loadHDF5. It's also handle as expected by _pyraveio_load. So that should not cause any problems as far as I can see. The LazyLoader tokenizes preloadQuantities and creates a list of quantities that are looped over and that code also looks to handle memory properly.

The filename should not be released by loadHDF, it's released when the RaveIO_t-instance is released which also seems to be handled as expected by the python-wrapper.

I have tried to force a segementation fault with the file you sent but I can't reproduce it. I would appreciate if you could provide me with: 1) A simple program using rave & the file you are having problem so that I can reproduce the segementation fault. 2) A stack trace from the SIGSEGV core dump so that I get a hint where the problem is.

MilkaRadojevic commented 2 months ago

Thanks Anders for this clarification.

First, some questions related to 'The LazyLoader tokenizes preloadQuantities and creates a list of quantities that are looped over':


  1. A simple program using rave & the file you are having problem so that I can reproduce the segementation fault.

I use a set of BALTRAD libraries with rave and complie into the executable odc_toolbox. I can share with you the compiled version or if you want the toolbox so you can try to compile by yourself.

  1. A stack trace from the SIGSEGV core dump so that I get a hint where the problem is.

Here below is trace from /var/log/messages-20240818 as provided to me by our IT team. This when the system started first to freeze (execution date-time: 20240816111500). You will also find down below snipped from the corresponding production log. If you look at timing you will see that the log points out to a moment when rave_io.c:486 fails.

Aug 16 11:17:01 sotrtm13-sine systemd[1]: Started Session c33896 of user soprajob.
Aug 16 11:17:02 sotrtm13-sine systemd[1]: session-c33896.scope: Succeeded.
Aug 16 11:17:08 sotrtm13-sine kernel: python3[77990]: segfault at 28ad000 ip 0000149c6eb43ce3 sp 00007ffeb9eff4e8     error 4 in libc-2.28.so[149c6ea74000+1bc000]
Aug 16 11:17:08 sotrtm13-sine kernel: Code: 17 e0 c5 f8 77 c3 48 3b 15 da 48 2f 00 0f 83 25 01 00 00 48 39 f7 72 0f 74 12 4c 8d 0c 16 4c 39 cf 0f 82 c5 01 00 00 48 89 d1 <f3> a4 c3 80 fa 10 73 17 80 fa 08 73 27 80 fa 04 73 33 80 fa 01 77
Aug 16 11:17:08 sotrtm13-sine systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Aug 16 11:17:08 sotrtm13-sine systemd[1]: Started Process Core Dump (PID 78001/UID 0).
Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Resource limits disable core dumping for process 77990 (python3).
Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Process 77990 (python3) of user 65374 dumped core.
Aug 16 11:17:08 sotrtm13-sine systemd[1]: systemd-coredump@0-78001-0.service: Succeeded.
Aug 16 11:17:12 sotrtm13-sine systemd[1]: Stopping User Manager for UID 65374...

A snipped from production log:

bin/odc_toolbox -i /disque_local/utmp/sms/E/tache_19682626/toolbox_input -o /disque_local/utmp/sms/E/tache_19682626  /composite_input --qc hac-increment,ropo,beamb,satfilter,hac-filter,qi-total --procs 32 --verbose
2024/08/16 11:17:08 :       ERROR : Preloading of file failed: /disque_local/utmp/sms/E/tache_19682626/toolbox_input/T_PAGZ54_C_LYBM_20240816111006.hdf (rave_io.c:486)
andershenja commented 2 months ago

Thanks Anders for this clarification.

First, some questions related to 'The LazyLoader tokenizes preloadQuantities and creates a list of quantities that are looped over':

  • What happens when the reading of one quantity fails? A given list of quantities is then broken.

It depends but in this situation, the actual reading of data is performed inside hlhdf. It will load all nodes until it fails to read some data and in that case it will return with an error. Since the preload fails, the lazy loader will be released an all data associated with the HDF5-file should be released.

  • As a result, the code should exit the loop, release the memory and close a file, right?

Yes. The file will be closed. The filename is however associated to the RaveIO-instance and will not be released until the instance is destroyed. Either by the python gc when a = _raveio.new(), a.filename=..., a.load() is used or else directly if _raveio.open("...") fails.

  • Somehow I get the impression from this experience that it gets blocked somewhere.

Hard to say what it can be, maybe something with the filesystem or some global lock in python !?

  1. A simple program using rave & the file you are having problem so that I can reproduce the segementation fault.

I use a set of BALTRAD libraries with rave and complie into the executable odc_toolbox. I can share with you the compiled version or if you want the toolbox so you can try to compile by yourself.

  1. A stack trace from the SIGSEGV core dump so that I get a hint where the problem is.

Here below is trace from /var/log/messages-20240818 as provided to me by our IT team. This when the system started first to freeze (execution date-time: 20240816111500). You will also find down below snipped from the corresponding production log. If you look at timing you will see that the log points out to a moment when rave_io.c:486 fails.

Aug 16 11:17:01 sotrtm13-sine systemd[1]: Started Session c33896 of user soprajob.
Aug 16 11:17:02 sotrtm13-sine systemd[1]: session-c33896.scope: Succeeded.
Aug 16 11:17:08 sotrtm13-sine kernel: python3[77990]: segfault at 28ad000 ip 0000149c6eb43ce3 sp 00007ffeb9eff4e8     error 4 in libc-2.28.so[149c6ea74000+1bc000]
Aug 16 11:17:08 sotrtm13-sine kernel: Code: 17 e0 c5 f8 77 c3 48 3b 15 da 48 2f 00 0f 83 25 01 00 00 48 39 f7 72 0f 74 12 4c 8d 0c 16 4c 39 cf 0f 82 c5 01 00 00 48 89 d1 <f3> a4 c3 80 fa 10 73 17 80 fa 08 73 27 80 fa 04 73 33 80 fa 01 77
Aug 16 11:17:08 sotrtm13-sine systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Aug 16 11:17:08 sotrtm13-sine systemd[1]: Started Process Core Dump (PID 78001/UID 0).
Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Resource limits disable core dumping for process 77990 (python3).
Aug 16 11:17:08 sotrtm13-sine systemd-coredump[78002]: Process 77990 (python3) of user 65374 dumped core.
Aug 16 11:17:08 sotrtm13-sine systemd[1]: systemd-coredump@0-78001-0.service: Succeeded.
Aug 16 11:17:12 sotrtm13-sine systemd[1]: Stopping User Manager for UID 65374...

Yes, it looks like it's a python3-binary that is crashing. The coredump is available using for example the coredumpctl command. Then after the coredump has been retrieved it should be possible to write "gdb python3 core.". After that "bt" to get the backstrace (r something similar).

A snipped from production log:

bin/odc_toolbox -i /disque_local/utmp/sms/E/tache_19682626/toolbox_input -o /disque_local/utmp/sms/E/tache_19682626  /composite_input --qc hac-increment,ropo,beamb,satfilter,hac-filter,qi-total --procs 32 --verbose
2024/08/16 11:17:08 :       ERROR : Preloading of file failed: /disque_local/utmp/sms/E/tache_19682626/toolbox_input/T_PAGZ54_C_LYBM_20240816111006.hdf (rave_io.c:486)

Have you got a running odc_toolbox on the ODE or similar that I can access?

MilkaRadojevic commented 2 months ago

It depends but in this situation, the actual reading of data is performed inside hlhdf.

Indicating that something might failed inside hlhdf.

it looks like it's a python3-binary that is crashing.

I shall report this to our IT. I run tests on my dev server (supposed to be a copy of production nodes) but I cannot see the segfault; rave_pgf.log shows that file is deleted.

[]$ date; /home/mdso/radojevicm/dso_cirrus/toolbox3/bin/odc_toolbox -i input_lybm -o output_lybm --qc 'hac-increment,ropo,beamb,satfilter,hac-filter,qi-total' --verbose
ven. août 23 09:51:13 GMT 2024
2024/08/23 09:51:13 :       ERROR : Preloading of file failed: input_lybm/T_PAGZ54_C_LYBM_20240821081006.hdf (rave_io.c:486)
Objects created: 1509
Objects deleted: 1509
Objects pending: 0
[]$ more rave_pgf.log
2024-08-23 09:51:13,758 ERROR    ForkPoolWorker-1: T_PAGZ54_C_LYBM_20240821081006.hdf: Failed to open file - Deleting file.
2024-08-23 09:51:13,766 INFO     No statistics

Have you got a running odc_toolbox on the ODE or similar that I can access?

We don't use ODE, it does not serve us for development purposes. The source of CIRRUS is in https://github.com/opera-radar/opera-cirrus-operational The execution of odc_toolbox in https://github.com/opera-radar/opera-cirrus-operational/blob/master/scripts/lance_cirrus.sh#L228

Thank you Anders.

https://github.com/opera-radar/opera-cirrus-operational

andershenja commented 2 months ago

It depends but in this situation, the actual reading of data is performed inside hlhdf.

Indicating that something might failed inside hlhdf.

Yes, it can't read the data I mentioned earlier which also h5dump fails to do. However, there is nothing in hlhdf indicating that the problem you are experiencing originates from hlhdf and without a way for me to reproduce the problem it's almost impossible to find the problem.

it looks like it's a python3-binary that is crashing.

I shall report this to our IT. I run tests on my dev server (supposed to be a copy of production nodes) but I cannot see the segfault; rave_pgf.log shows that file is deleted.

[]$ date; /home/mdso/radojevicm/dso_cirrus/toolbox3/bin/odc_toolbox -i input_lybm -o output_lybm --qc 'hac-increment,ropo,beamb,satfilter,hac-filter,qi-total' --verbose
ven. août 23 09:51:13 GMT 2024
2024/08/23 09:51:13 :       ERROR : Preloading of file failed: input_lybm/T_PAGZ54_C_LYBM_20240821081006.hdf (rave_io.c:486)
Objects created: 1509
Objects deleted: 1509
Objects pending: 0
[]$ more rave_pgf.log
2024-08-23 09:51:13,758 ERROR    ForkPoolWorker-1: T_PAGZ54_C_LYBM_20240821081006.hdf: Failed to open file - Deleting file.
2024-08-23 09:51:13,766 INFO     No statistics

Just a speculation, since you mentioned that you are loosing memory. Can there be something related to multiprocessing causing a problem with hanging processes. This is something we have experienced more than once if an instance of a child processes dies unexpectedly leaving the complete tiling waiting for ever. I believe I have mentioned this as a possible problem a few months ago.

Have you got a running odc_toolbox on the ODE or similar that I can access?

We don't use ODE, it does not serve us for development purposes. The source of CIRRUS is in https://github.com/opera-radar/opera-cirrus-operational The execution of odc_toolbox in https://github.com/opera-radar/opera-cirrus-operational/blob/master/scripts/lance_cirrus.sh#L228

Thank you Anders.

https://github.com/opera-radar/opera-cirrus-operational

andershenja commented 2 months ago

I wrote a simple test program that only tries to open the file in a loop to see if there is any kind of memory leak. top didn't show any indication that memory was increasing which would have been the case if there was an obvious memory problem related to erroneous reads in hlhdf, raveio or the lazy-loader. The code didn't crash either. Not during execution or when the program exited.

import _rave, _raveio if name=="main": _rave.setDebugLevel(_rave.Debug_RAVE_SPEWDEBUG) for i in range(0, 10000): try: a=_raveio.open("T_PAGZ54_C_LYBM_20240821081006.hdf").object except: pass

MilkaRadojevic commented 2 months ago

Thank you very much for all tests and for sharing the ideas.

Can there be something related to multiprocessing causing a problem with hanging processes. This is something we have experienced more than once if an instance of a child processes dies unexpectedly leaving the complete tiling waiting for ever.

It's possible. I noticed before that the function that handles multiprocessing in rave is quite basic and the same as in era of Python 2. Do you think it would be worth of checking that out and making it more up to date with more try-except conditions?

andershenja commented 2 months ago

Yes, we did some modifications to the tiled compositing to prevent problems with crashing processes. It should be possible to do similar modification of the code used by the odc_toolbox (odc_polarQC.py).

IE-smhi commented 2 months ago

Greetings @MilkaRadojevic,

Based on the information provided, our ability to diagnose the issue is limited due to the absence of a core dump and a detailed stack trace. Additionally, since this issue involves a Python process encountering a segmentation fault ensure that debug information is available. This will allow us to see the relevant names and symbols in the core dump.

Here is what we know so far:

To assist you further, please enable core dumps on your system and gather a full Bill of Materials (BoM) for the software involved.

Steps to enable core dumps and debuginfo

Optional: Ensure that Python and any relevant C extensions are compiled with debugging symbols. This is typically achieved by installing the -debuginfo packages for the relevant libraries and binaries. For example, on RPM-based systems, you might need to install the debug information for Python:

If you have custom-built Python extensions or other relevant C libraries, ensure they are compiled with the -g option to include debugging information.

The core dump generation is controlled by the core resource limit, which determines the maximum size of a core dump file.

  1. Check the current core dump limit:
    • Run the following command to check the current core dump limit:
      ulimit -c
  2. Set core dump size:
    • Add ulimit -c unlimited to the script you are executing to ensure core dumps are enabled during the execution.
    • Additionally, run ulimit -c unlimited in the terminal to apply this setting for your current session.
    • Note: This command affects only the current shell session.
  3. Alternatively set core dump size in /etc/security/limits.conf:
    • Modify the /etc/security/limits.conf file to set the core dump size persistently.
  4. Configure core dump pattern and file name in /etc/sysctl.conf:
    • Edit the /etc/sysctl.conf file to specify where core dumps should be saved:
      kernel.core_pattern=/tmp/core_%e.%p.%h.%t
      kernel.core_uses_pid = 1
      • kernel.core_pattern specifies the location and format of the core dump file.
      • kernel.core_uses_pid appends the process ID to the core dump file name.
    • Apply these changes by running:
      sudo sysctl -p
  5. Verify and test core dump configuration:
    • Log out and log back in, or restart the system to ensure the new limits are applied.
    • Test the core dump configuration by forcing a process to crash:
      kill -SIGSEGV $$
  6. Optional: Check and adjust systemd core dump service settings
    • The systemd-coredump service handles core dumps in systemd-based systems.
    • To view the configuration:
      sudo systemctl status systemd-coredump

      Alternatively, edit /etc/systemd/coredump.conf for custom settings.

We hope this information helps in generating a core dump. If the issue persists, please ensure that no additional security mechanisms, such as SELinux or AppArmor, are blocking core dumps.

For more detailed guidance, you can refer to Red Hat's documentation on core dumps. https://access.redhat.com/solutions/4896

Provide a detailed Bill of Materials (BoM)

To help us diagnose the issue more effectively, please provide a comprehensive list of the software and libraries involved, including their versions.

This should include:

Please ensure that the information is comprehensive and corresponds to the environment where the segmentation fault occurs. Double-check that the process (PID) experiencing the segmentation fault is running the expected code and that all relevant libraries are correctly identified.

Please send this information via email or another secure method of communication if it is considered sensitive.

Important: core dump handling

Do not post the core dump publicly on GitHub or any other public forum. Instead, please email the core dump directly to us or use another secure method of communication.

Yours truly,

I.E. - SMHI.

MilkaRadojevic commented 2 months ago

Thank you @IE-smhi for such detailed analysis of presented problem.

I will ask our IT team responsible for the operational environment at Météo-France to provide us with a core dump and a detailed stack trace, and if still available it will be send to Anders via email. Meanwhile, I will forward your analysis and recommendations to them since the access to the production platform is limited to us developers.

The answers I can provide to you:

Operating System: Specify the exact version (cat /etc/os-release).

Rocky Linux 8.5

Python version: Run python --version to obtain this information.

Python 3.9

RAVE version: Please specify the exact version of the RAVE library being used.

Tag 38 + python script from Anders (python39.patch) to allow for compilation of rave at Météo-France

Other dependencies: List any additional libraries or tools involved.

gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)

Best regards, Milka

andershenja commented 4 weeks ago

Closing issue until further information is available.