EXCEPTION_ACCESS_VIOLATION when using multiple engines at once

Venorcis commented 3 years ago

As discussed by email as well. Two different traces seem to occur when using two (or more) SWI engines at the same time.

Current thread (0x000001ce99292830):  JavaThread "maker" [_thread_in_native, id=21932, stack(0x00000029cef00000,0x00000029cf000000)]

Stack: [0x00000029cef00000,0x00000029cf000000],  sp=0x00000029ceffdcc0, free space=1015k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
C  [libswipl.dll+0x2d2a]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j  org.jpl7.fli.Prolog.next_solution(Lorg/jpl7/fli/qid_t;)Z+0
j  org.jpl7.Query.fetchNextSolution()Z+4
j  org.jpl7.Query.hasMoreSolutions()Z+20
j  org.jpl7.Query.hasNext()Z+1
j  org.jpl7.Query.allSolutions()[Ljava/util/Map;+26

siginfo: EXCEPTION_ACCESS_VIOLATION (0xc0000005), reading address 0xffffffffffffffff

and

Current thread (0x155afc00):  JavaThread "grinder" [_thread_in_native, id=19580, stack(0x16200000,0x16250000)]

siginfo: ExceptionCode=0xc0000005, reading address 0x00000004

Stack: [0x16200000,0x16250000],  sp=0x1624f350,  free space=316k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C  [libswipl.dll+0xa4587]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j  org.jpl7.fli.Prolog.destroy_engine()I+0

The annoying thing is that this doesn't happen every time. The chance seems to be 1/50 or something from my tests, either with 32-bit Java 8 or with 64-bit Java 15 (using SWI 8.3.8).

I've created an example that at least somewhat reproduces this: https://mega.nz/file/591kVZBb#u4ROK2pwSsAHODv-eqBua-UowXBZrqlovEoSsYDShTY The BAT file runs the same GOAL agent over and over, and you'll see hs_err_pid####.log files appearing in the directory sometimes with the traces given here. Note that this packages comes with a bundled SWI; on Windows this is extracted to e.g. C:\Users\Vince\AppData\Local\GOAL\swilibs\1.4.0-SNAPSHOT. Once extracted you could replace one of the DLLs there (e.g. for adding debug print statements).

JanWielemaker commented 3 years ago

Thanks. Any clue what I would need to do to get this running under Linux? One never knows of course, but most of these things are platform independent and my experience with the diagnostic toolbox on Linux is a lot more advanced.

Venorcis commented 3 years ago

Thanks. Any clue what I would need to do to get this running under Linux? One never knows of course, but most of these things are platform independent and my experience with the diagnostic toolbox on Linux is a lot more advanced.

Hi Jan, though I haven't tested it at all yet, the example (i.e. GOAL) should actually work fine on Linux as well, though I'm of course not 100% sure if the same threading bug will occur there (there are platform-specific differences there AFAIK). The BAT file simply loops a java command every 2 seconds, which should be easily reproducible in an SH file too I assume 😁

JanWielemaker commented 3 years ago

Yes. I was really expecting trouble, but apparently it finds the Prolog components from my installation. I get this:

janw (GOAL; master) 20_> timeout 2s java -jar runtime-2.2.0-SNAPSHOT.jar coffee.mas2g -v
GOAL 2.2.0-SNAPSHOT build 202009291920 on OpenJDK Runtime Environment 11.0.8+10-post-Ubuntu-0ubuntu120.04 amd64
launching '/home/janw/Bugs/GOAL/coffee.mas2g'.
start-up complete.
waiting for the first agent to be launched...
starting agent 'maker'.
starting agent 'grinder'.
started agent 'maker' (took 10ms).
started agent 'grinder' (took 2ms).
agent 'grinder' terminated successfully.

If I do not set a timeout or set the timeout to 4s, I get a bit more output:

all agents have stopped running.
shutting down the multi-agent system.
agent 'maker' terminated successfully.
ran for 0 seconds.

Does this look good? Should I use 2s or 4s. Killing the stuff forcefully while not yet complete may easily cause problems. Could it be we are looking into a close-down issue in the first place?

Venorcis commented 3 years ago

AFAIK it automatically extracts and uses the embedded SWI version (we supply it for all platforms), never something from your installation.

Yeah there is some problem with one of the two agents not always shutting down that I perhaps should track down and fix first. The second trace with the destroy_engine might actually be related to shutting down indeed (though I still believe such a C-level exception should never happen). The one on fetchNextSolution happened in regular runs as well (before termination).

JanWielemaker commented 3 years ago

AFAIK it automatically extracts and uses the embedded SWI version (we supply it for all platforms), never something from your installation.

That seems rather unlikely as this now runs happily on Linux :smile: So far it seems running happily. The timeout is indeed rather irrelevant. Somtimes it just hangs after agent 'grinder' terminated successfully. I have a telco now, so I just leave it running ....

JanWielemaker commented 3 years ago

Good. The mystery is solved, your snapshot also contains the Linux SWI-Prolog binaries, so I am in fact running your code. Good thing is that the Linux version also crashes. Kind of similar rate: using a 4s timeout it crashes about every 3-4 minutes or, indeed, about 1 in 50. The stack is a little different. Almost all (have to check) crash here:

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  org.jpl7.fli.Prolog.destroy_engine()I+0
j  swiprolog.SwiPrologInterface.removeDatabase(Lswiprolog/database/PrologDatabase;)V+66
j  swiprolog.database.PrologDatabase.destroy()V+16
j  swiPrologMentalState.SwiPrologMentalBase.destroy()V+4
j  mentalState.MentalModel.cleanUp()V+53
j  mentalState.MentalState.cleanUp()V+47
j  mentalState.MentalStateWithEvents.cleanUp()V+4
j  goal.core.runtime.service.agent.RunState.dispose()V+21
j  goal.core.agent.GOALInterpreter.dispose()V+20
j  goal.core.agent.Controller.setTerminated()V+32
j  goal.core.agent.GOALInterpreter$2.run()V+553
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@11.0.8
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@11.0.8
j  java.lang.Thread.run()V+11 java.base@11.0.8
v  ~StubRoutines::call_stub

It gives some more info about what is going on in the C side than the Windows version. Will try to attach a debugger. I'll first try to fix this on Linux :smile:

Oh, also one with this, so the issue seems cross-platform.

j  org.jpl7.fli.Prolog.next_solution(Lorg/jpl7/fli/qid_t;)Z+0
j  org.jpl7.Query.fetchNextSolution()Z+4
j  org.jpl7.Query.hasMoreSolutions()Z+20
j  org.jpl7.Query.hasNext()Z+1
j  org.jpl7.Query.allSolutions()[Ljava/util/Map;+26
j  swiprolog.database.PrologDatabase.rawquery(Lswiprolog/language/PrologQuery;)Ljava/util/Set;+1
j  swiprolog.database.PrologDatabase.flushWriteCache()V+15

Venorcis commented 3 years ago

With the daily Windows build I no longer get these errors, thanks very much Jan :)

Would be great if we can get updated Mac & Linux builds too so we can do the new GOAL release 😁

Venorcis commented 3 years ago

Hi @JanWielemaker,

We got a new report that EXCEPTION_ACCESS_VIOLATIONs are being thrown in multi-threaded execution. We're actually on SWI 8.3.17 already right now.

They are happening in weird/different code locations this time, seemingly when creating terms (get_float, get_integer_big, get_string_chars, etc.), but again only when using multiple agents (and thus SWI engines) at once.

The reporter attached these logs: hs_err_pid13368.log hs_err_pid12696.log hs_err_pid15684.log hs_err_pid1264.log

I've asked him to provide a minimal example that reproduces this, so hopefully I can provide you with that soon.

JanWielemaker commented 3 years ago

Looks like similar crashes. Just a hinge, could it be some Java thread is querying Prolog without having created a Prolog engine? Example typically doesn't have to be minimal. Only feasible to install and reproducible.

Venorcis commented 3 years ago

Looks like similar crashes.

To me it looks different because earlier it always crashed on next_solution and now it seems to happen when creating terms.

Just a hinge, could it be some Java thread is querying Prolog without having created a Prolog engine?

No, all crashes originate from within the GOALInterpreter somewhere, which is the agent running in it's own thread. Also, according to the reporter, it happens at different times during execution (but seemingly quite consistently). It does seem from the logs like he is running about 50 agents.

Example typically doesn't have to be minimal. Only feasible to install and reproducible.

I haven't received any code yet, but he promised to share something. I might try reproducing it myself tomorrow.

JanWielemaker commented 3 years ago

To me it looks different because earlier it always crashed on next_solution and now it seems to happen when creating terms.

I meant the 4 logs look similar. Yes, the problem seems very different than what triggered the original issue. We should move this to a new issue.

SWI-Prolog / packages-jpl

EXCEPTION_ACCESS_VIOLATION when using multiple engines at once #85