RascalSoftware / python-RAT

Python interface for RAT
1 stars 5 forks source link

run crashes on Linux if run outside of main thread #70

Closed alexhroom closed 2 months ago

alexhroom commented 3 months ago

On Linux (including IDAaaS), attempting to run the following script will produce a segfault:

from concurrent.futures import ThreadPoolExecutor

import RATapi as RAT

executor = ThreadPoolExecutor()
future = executor.submit(RAT.examples.DSPC_standard_layers)
project, results = future.result()

This is an issue because RasCAL-2 needs long-running code (e.g. RAT.run()) to run outside the main loop, as the main loop is used for handling and updating the GUI itself. It seems to run successfully on Windows (@DrPaulSharp managed to run it okay). If anyone else would like to run the script on whatever devices they have to hand that'd be useful!

Various observations:

DrPaulSharp commented 3 months ago

Thanks Alex, I can confirm that I can run the test ok on windows, but I get a seg fault on linux (specifically through using the Windows Subsystem for Linux).

alexhroom commented 3 months ago

some debugging: running in gdb python3: for custom XY:

Thread 80 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4de00640 (LWP 11047)]
0x00007ffff38780d3 in RAT::SLDFunction (x=2.8571428571428572, SLD=..., sldVal=...) at cpp/RAT/coder_array.h:164
164     T& operator[](SZ _index) {

for standard layers:

Thread 80 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4de00640 (LWP 11879)]
0x00007ffff387fc56 in RAT::allocateLayersForContrast (contrastLayers=..., outParameterisedLayers=..., useImaginary=useImaginary@entry=0 '\000', thisContrastLayers_data=thisContrastLayers_data@entry=0x2af80, thisContrastLayers_size=thisContrastLayers_size@entry=0x7fff4ddbe568) at cpp/RAT/allocateLayersForContrast.cpp:56
56            thisContrastLayers_data[i1 + n * i] = 0.0;
alexhroom commented 3 months ago

it seems that there are various issues around the GIL when it comes to calling Python from inside C threads? https://github.com/python/cpython/issues/111034 https://docs.python.org/3/c-api/init.html#non-python-created-threads unsure if the problem may be that

StephenNneji commented 3 months ago

Irrespective of this issue, I will recommend using a Process instead of a python thread as the latter will affect the responsiveness of the GUI. The main problem with the python process is that it uses pickling to copy inputs i.e. Project and Controls but because of some issue the Project class cannot be pickled (Apologies, for some reason we did not open an issue on GitHub).

We could either try to fix the pickling issue so we can use the run function

RAT.run(project, control)

or we could just pickle the inputs to RATMain

RATapi.rat_core.RATMain(problem_definition, cells, limits, cpp_controls, priors )

alexhroom commented 3 months ago

@StephenNneji in the GUI code i was using a QtCore.QThread, which seems to be the recommended way to keep long-running things 'out of the way' without affecting the GUI. the issue above uses a regular Python thread in order to be a smaller reproduceable example (and to confirm that the segfault is on our end and not Qt's!)

StephenNneji commented 3 months ago

As far as I know, QThread are just a wrapper on the python thread https://stackoverflow.com/questions/1595649/threading-in-a-pyqt-application-use-qt-threads-or-python-threads except if something has been changed in PyQt6

alexhroom commented 3 months ago

yeah, with the main benefits of the wrapper being that it handles signals/slots and so on to avoid the thread choking up the GUI! as in the post:

The main difference is that QThreads are better integrated with Qt (asynchrnous signals/slots, event loop, etc.).

will try it with a ProcessPoolExecutor when we can get the thing pickled correctly anyway and see if it's more performant

DrPaulSharp commented 3 months ago

Memory checking shows an Invalid write of size 8 in the following code from allocateLayersForContrast.cpp:

      n = coder::internal::intlength(contrastLayers.size(0), contrastLayers.size
        (1));
      thisContrastLayers_size[0] = n;
      thisContrastLayers_size[1] = 5;
      for (i = 0; i < 5; i++) {
        for (i1 = 0; i1 < n; i1++) {
          thisContrastLayers_data[i1 + n * i] = 0.0;
        }
      }

Specifically, the line thisContrastLayers_data[i1 + n * i] = 0.0;. This line is merely the initialisation of the array, so does not read in any RAT data. As far as I can tell, the array indices are correct, and the array is of type double, so I'm not sure what the cause of the invalid memory access is.

DrPaulSharp commented 3 months ago

We also find a memory issue in SLDFunction.cpp. In this code, we construct arrays for all values less than and greater than a particular function. It seems to me that there is a possibility that in the way this is done in the C++ code there is the possibility of an empty array, but this doesn't occur in the matlab code.

alexhroom commented 2 months ago

fascinatingly, it works completely fine if RAT is imported inside the thread:

from concurrent.futures import ThreadPoolExecutor

def thread_func():
    import RATapi as RAT

    project, results = RAT.examples.DSPC_standard_layers()
    return project, results

executor = ThreadPoolExecutor()
future = executor.submit(thread_func)
project, results = future.result()

doesn't segfault. even stranger is that if you import it outside the thread and then again inside it does segfault, but i may just be missing some detail of python import implementation here:

from concurrent.futures import ThreadPoolExecutor

import RATapi as RAT

def thread_func():
    import RATapi as THREADRAT

    project, results = THREADRAT.examples.DSPC_standard_layers()
    return project, results

executor = ThreadPoolExecutor()
future = executor.submit(thread_func)
project, results = future.result()
DrPaulSharp commented 2 months ago

I've had a fresh look at allocateLayersForContrast.m in the MATLAB source:

image

@arwelHughes Can you please confirm whether the code on line 35 should instead read: thisContrastLayers(i, :) = []; instead of thisContrastLayers(1, :) = [];. I can't guarantee this bug is causing the memory issues, but we should look to fix it urgently regardless.

arwelHughes commented 2 months ago

Hi Paul, Well, firstly I have a different version to you – did you put the ‘disp’ in for debugging?

@.***

It probably should be ‘thisContrastLayers(I,:)…. as you suggest, but in practice it makes no difference. That line is never actually reached in normal usage (because there will never be an empty layer), but I think it was necessary because otherwise Coder throws a ‘thisContrastLayers is not fully defined on all execution paths’ error (I think!). Cheers, Arwel

From: Paul Sharp @.> Date: Monday, 2 September 2024 at 17:31 To: RascalSoftware/python-RAT @.> Cc: Hughes, Arwel (STFC,RAL,ISIS) @.>, Mention @.> Subject: Re: [RascalSoftware/python-RAT] run crashes on Linux if run outside of main thread (Issue #70)

I've had a fresh look at allocateLayersForContrast.m in the MATLAB source:

image.png (view on web)https://github.com/user-attachments/assets/31d9b618-85a5-4d22-8caf-c6688f8b3dee

@arwelHugheshttps://github.com/arwelHughes Can you please confirm whether the code on line 35 should instead read: thisContrastLayers(i, :) = [];. I can't guarantee this bug is causing the memory issues, but we should look to fix it urgently regardless.

— Reply to this email directly, view it on GitHubhttps://github.com/RascalSoftware/python-RAT/issues/70#issuecomment-2325074999, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGMNVXUMIHUG5SAJXSNJDM3ZUSHEZAVCNFSM6AAAAABM4FBUK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRVGA3TIOJZHE. You are receiving this because you were mentioned.Message ID: @.***>

DrPaulSharp commented 2 months ago

Thanks Arwel. Yes, the disp lines are for debugging.

alexhroom commented 2 months ago

since we've merged #74 i've found that RATMain does not produce a segfault in a separate process (only a separate thread). so this issue is a lot less urgent since we can use processes for multiprocessing in the GUI!

DrPaulSharp commented 2 months ago

closed by #78