[Cpp] Separate parser/lexer instances not threadsafe

jrobsonchase commented 7 years ago

I'm aware that a single instance of the parser/lexer objects cannot be shared across threads, but I'm seeing behavior where instances created/destroyed/never leaving their thread interfere with completely separate instances on other threads.

It appears that something in the atn prediction cache is to blame, at least to one not terribly familiar with the internals. Calling the parser once in the main thread before spawning child threads seems to fix the problem, but only if the inputs are all the same.

Repostory with example code: https://github.com/Pursuit92/antlr_cpp_threads

sharwell commented 7 years ago

It looks like the DFA in the C++ target did not implement a concurrent set for the states field. There is a _mutex, but it is defined on the recognizer where it actually needs to be defined on the DFA, and thus offers no protection when multiple instances are involved.

:memo: This analysis only covers one clear point of failure. Other code may exist which precludes the use of this target in multithreaded scenarios.

jrobsonchase commented 7 years ago

Yeah, I noticed a lot of similar scenarios across the codebase where the wrong thing was getting locked. Pretty much every place in the java runtime where you see synchronized(thing), the Cpp runtime is only locking this rather than the thing. I tried fixing those where I could, but still had the same problems.

sharwell commented 7 years ago

@Pursuit92 The hart part isn't the locks. The hard part is the many places where I omitted locks altogether and rely on the Java Memory Model to provide correct lock-free behavior in concurrent scenarios, especially on the fast path.

jrobsonchase commented 7 years ago

I figured that was probably the case. Making all of the static things also thread_local fixes the race condition for obvious reasons, but might hurt performance (or at least prevent potential speedups). Potentially a short-term fix until someone can take the time to fix the actual races?

sharwell commented 7 years ago

Making all of the static things also thread_local fixes the race condition for obvious reasons

This would also force parser instances to have thread affinity. In addition to the fact that parser instances are already not allowed to be used concurrently from multiple threads, you would now be in a situation where a parser instance is only allowed to ever be used from the thread on which it was created. The reason for this is the static variable holding the ATN for a parser is copied into an instance variable when the parser is created. After that point you'd no longer be accessing it through a thread_local.

jrobsonchase commented 7 years ago

This would also force parser instances to have thread affinity.

I'd take that over being unable to use separate instances of the parser concurrently.

You would now be in a situation where a parser instance is only allowed to ever be used from the thread on which it was created.

How common is the scenario where a parser is instantiated on one thread and used non-concurrently on others? and would that somehow perform better than having multiple concurrent parsers tied to their threads?

parrt commented 7 years ago

@mike-lischke any thoughts?

mike-lischke commented 7 years ago

Yes, this is a known problem. I'm hit by it myself and others have mentioned it already. But I have a few open ends atm that I need to finish first before I can come back to this problem. Hence the delay.

About the synchronization: in C++ you don't lock an object, but a path. It doesn't matter where the mutex is placed as long as it is used consequently to lock all concurrent accesses. So, my assumption is tha there are paths which are not synchronized, leading to the described behavior.

sharwell commented 7 years ago

About the synchronization: in C++ you don't lock an object, but a path. It doesn't matter where the mutex is placed as long as it is used consequently to lock all concurrent accesses.

I don't think this is an accurate characterization. The mutex in C++ appears to work the same way as synchronized in Java. The only difference as it applies to us is you need a mutex instance associated with each object you might want to lock (the language/runtime won't create it for you as Java does).

Also, as I mentioned in my earlier posts there are paths in the Java runtime which are intentionally not synchronized even though they use shared (and sometimes mutating) state. These paths rely on types and operations which are known to produce correct results even in the face of concurrent accesses.

mike-lischke commented 7 years ago

At the end they all use a mutex. However, what mutex is used is what is different between languages. In Java you don't specify a mutex manually, but "lock an object", which is nothing more than to say: lock the execution path using the mutex of this object. In C++ we do that manually and hence have full control over which mutex to use. So we don't look at an object to synch execution, but take a mutex as gatekeeper. It doesn't really matter which mutex you use (one stored in each object or a global one), as long as all execution paths that must be synchronized use this one.

For the C++ runtime I decided to have a central mutex for all ATN/DFA changes in the ATNSimulator class, instead of one for each DFA instance. For complex grammars we can end up with hundreds of mutexes otherwise - a huge waste of system resources. I believe the synchronization of access to the shared static ATN is correct wrt what is obvious from the Java implementation (the synchronize() calls). However, it could well be there is missing explicit sync where you use implicit sync in Java. The question is: how to find out where this applies? To me it looks as if I now have to check every access to see if that changes the shared state and add locks there. Better ideas?

sharwell commented 7 years ago

It doesn't really matter which mutex you use (one stored in each object or a global one), as long as all execution paths that must be synchronized use this one. [...] For the C++ runtime I decided to have a central mutex for all ATN/DFA changes in the ATNSimulator class, [...]

The mutexes used have substantial impact on both correctness and performance. This is an area where you do not want to deviate from the semantics of the Java target. As a first step, you'll want to go through each use of synchronized in the Java target and ensure that you are using an equivalent mutex in the C++ target with no changes to lock granularity.

The question is: how to find out where this applies?

These are not marked in code. That said, it looks like there might not be as many of them - perhaps none - in the reference Java code as there are in my optimized fork. I can tell you that one case where the Java target has concurrent readers/writers is DFAState.edges. In Java, a fixed array of size N can be treated as N distinct variables that won't move, and thus don't need synchronization to modify position m while some other position n is being read. Any change from an array to a sparse map will potentially violate this condition in a target.

:warning: It's performance-critical that you not use a mutex for accessing DFAState.edges except in the case where it is actually being modified. The optimized fork uses a sparse array and my use of a lock to keep it safe was the cause of an early bug report for that target (the lock has since been removed there).

mike-lischke commented 7 years ago

Thanks for all the advice, Sam. In fact I checked the synch calls already multiple times and believe they are correct. There is one case where the same mutex is locked twice (unnecessarily) and forces me to use recursive locks. But I think I can get rid of that.

About the switch array -> map: I see in your branch you have done the same, realizing that an array is consuming a lot of memory, especially for large grammars. I had a test case where memory consumption dropped from ~600MB to ~400MB, just by this switch. So I definitely don't want to go back and rather would recommend all targets to do this switch. However, access sync is of course more tricky then.

You suggest not to lock the edges list, except when writing, however that doesn't work out (at least not in C++) that way. If you want to lock access you have to lock for both, read and write access, otherwise you could read a field while writing it (or in case of a map search an entry while internally the bucket list is modified). If I was on C++17 I could use a shared_mutex (multiple read, single write lock), so maybe I implement this myself for the ANTLR4 C++ runtime.

Another thought that came up is if I should give up on the shared mutex for the DFAStates and go back to own mutexes for that class. The point here is that a shared lock will lock all access, even if they concern different DFAStates, which very likely affects performance. However, having hundreds of mutexes is not funny either.

KvanTTT commented 7 years ago

ANTLR C# tunnelvisionlabs runtime have a multithreading issue at least for cache clearing: GetAllTokens and ClearDFA sometimes throws NRE in multithread mode.

sharwell commented 7 years ago

@KvanTTT That message is a bit misleading, specifically because:

Clearing the DFA cache is the only known race condition in that runtime.
Clearing the DFA cache is a generally discouraged practice. The reason the issue isn't resolved yet is it doesn't make sense to correct it in any way that could negatively affect the recommended use cases.

The C# target has been used for years in some scary-big installations (1000+ threads per machine) without finding other threading bugs.

sharwell commented 7 years ago

About the switch array -> map: I see in your branch you have done the same, realizing that an array is consuming a lot of memory, especially for large grammars.

It took a bit of work to get this right: https://github.com/sharwell/antlr4/compare/6274685f6ad9a2e4b53d794859a5dda70bfe84ca...5f70f34b6173ca667d19d9de914aab96836cec6b

Keep in mind that the performance tweaks at the end that removed locks from the fast path were essential for some of the scenarios described in my previous comment.

parrt commented 7 years ago

@mike-lischke note that "fixes issue #foo" doesn't seem to work. must be "fixes #foo"

jrobsonchase commented 7 years ago

It appears that this is still a problem. I updated the antlr submodule to the 4.6 tag in my repo, but I'm still getting the double free/segfault/abort/etc. errors when running separate parsers concurrently.

mike-lischke commented 7 years ago

Weird, my syntax checkers now run nicely in their background threads without any trouble. Can you make a test case showing the problem? And, best is to open a new issue for that.

rossy62 commented 4 years ago

I'm having an issue building the ANTLR4 runtime in Visual Studio2013

Line 37: of StringUtils.h contains thread_local UTF32Converter converter; The syntax "thread_local" must be C++ standard as Microsft deems this too complicated to parse and type, and clearly the syntax __declspec(thread) is much clearer as the word "local" is not present.

However, when I change the code to use __declspec(thread) the compiler complains that this syntax is only allowed on data types of static extent.
I'm currently poking at this (aka hacking), And will likely remove the thread_local and/or add static to see if I can work around it. I don't plan on having multiple instances of the parser at the moment. (That could change).

It's very possible that I have a local configuration/CMAKE issue, but thought I would post my experience. It's been about 2 years since I worked with ANTLR4, so I'm diving back in this time trying the C++ output generator.

Thanks for an otherwise awesome toolset. This is a minor issue.

antlr / antlr4

[Cpp] Separate parser/lexer instances not threadsafe #1435