BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.
http://bllip.cs.brown.edu/
227 stars 53 forks source link

parseIt is not threadsafe (was: Probabilities printed out of order for multi-threaded LM parser) #16

Open syllog1sm opened 10 years ago

syllog1sm commented 10 years ago

in first-stage/PARSE/parseIt.C , the probabilities are written out directly, instead of being pushed to the print stack. This means that they appear out of order in multi-threaded LM parsing.

dmcc commented 10 years ago

Thanks for the report! That's certainly an issue, but I'm afraid parseIt may have much larger issues.

At least in parsing mode, I don't think that parseIt is actually thread safe -- I've gotten different outputs (or crashes) when using multithreaded vs. single threaded so I don't recommend it. This is a long time overdue, so I'm finally updating the README and help menu to discourage people from using multiple threads with parseIt for now (b05059009b0ef5d578a1e7a424056f8307d56f6e)

Of course, if anyone wants to fix this printing issue or the larger multithreading problems, I wouldn't object :)

KantanLabs commented 7 years ago

Dear dmcc,

I don't know if there is any progress on the bllip parser since 2013, but currently I am using it for a large project -- where a lot of sentences are needed to be processed in parallel -- and I was wondering if the multi-threading issue is resolved.

Have there been some solutions to allow the bllip-parser to run multi-threaded and being thread-safe?

Thanks, Dimitar.

dmcc commented 7 years ago

Unfortunately, I'm not aware of any progress on BLLIP in this area. In the past, I've run multiple instances of the parser as a workaround. I still do not recommend using the -t option.

KantanLabs commented 7 years ago

Hello dmcc,

Thank you for the quick response. That's pity. I have built a server/wrapper around the parser and it works great on a single core. However, it is rather slow when comparing to , e.g., Stanford which runs multiple threads. On a single thread the BLLIP is actually faster and give a better accuracy.

I am willing to take a look at the code. Do you have any suggestions what needs to be fixed?

Cheers, Dimitar.

dmcc commented 7 years ago

Hi Dimitar,

That would be great if you're willing to take a look at it. It's been a while so my memory is pretty fuzzy on this, but I believe the issue is a race condition in the first stage parser which we never really narrowed down. If you repeatedly parse a small set of long enough sentences with -t > 1, you should be able to see some cases where you get different parses for the same sentence. Most of the code uses arrays to separate structures used in different threads (https://github.com/BLLIP/bllip-parser/search?utf8=%E2%9C%93&q=MAXNUMTHREADS&type=), but my guess is that there's some global structure that should have got this treatment, but didn't.

Good luck! David

KantanLabs commented 7 years ago

Hi David,

So, I did some look into the multithreading today. Something is definitely fishy. So what I did is just run some tests and some diagnostics. Here is a summary. I will continue digging and will keep posting.

Cheers, Dimitar.

dmcc commented 7 years ago

Thanks for the update -- please keep them coming! Yeah, that sounds (vaguely) familiar. I think I ran helgrind and it found issues, but I forget how much deeper I dug than that.

KantanLabs commented 7 years ago

OK, So, two things.

First, in the parseIt/mainLoop there is a print loop:

    while(!printStack.empty())
    {
        sleep(SLEEPTIME);
        workOnPrintStack(&printStack);
    }

which is outside of the reading/parsing loop. This loop and the sleep delay the printing process, even if the parse is finished. To be honest, I don't get why is this loop there since there is a print workOnPrintStack(&printStack);call inside the main loop (see line 289 in parseIt.C).

Second, helgrind detects a racing condition somewhere in MeChart::bestParse. It complains about complains about functions from fnSubFns,C which make access to theFullHist*. I assume that the issue arrises there - reading and writing something to FullHist in an uncontrolled way.

I haven't looked more in details today. If I have time next week I will dig more into it.

Cheers, Dimitar.

KantanLabs commented 7 years ago

Hi David,

I did some look around and to be honest didn't figure out what exactly causes helgrade to complain. I modified some things in the parse method.

First, switched the global sentence count from int to unsigned long long int. If I am not wrong, for more than 32,767 sentences there would be errors.

Second, the printing function I took out of the main loop - this way they should work on parallel (added one more semaphore). If you want, I will send the code for review.

But I wanted to ask - how do I set the option for multithreading (the -t option) with the python wrapper?

Thanks, Dimitar.

dmcc commented 7 years ago

switched the global sentence count from int to unsigned long long int

Yup, this makes sense. Please feel free to submit this as a separate patch.

how do I set the option for multithreading (the -t option) with the python wrapper

There used to be an option to pass a thread ID to the first-stage parser in the Python wrapper, but I removed it ( https://github.com/BLLIP/bllip-parser/commit/1858d751fc969182946252f808a559d4c1414e88) since it didn't seem like threads were going to get fixed. The main pieces were "ThreadManager" which let you sign out a ThreadSlot (thread ID) for exclusive use and a way to pass ThreadSlots to parse ( https://github.com/BLLIP/bllip-parser/commit/1858d751fc969182946252f808a559d4c1414e88#diff-491cfc4441b3f94d768196cd1252263cL201 ).

From your previous message:

Second, helgrind detects a racing condition somewhere in MeChart::bestParse. It complains about complains about functions from fnSubFns,C which make access to the FullHist*. I assume that the issue arrises there - reading and writing something to FullHist in an uncontrolled way.

This could definitely be an issue and explain why parsing was non-deterministic with multiple threads -- would be interested to hear if you were able to dig any deeper.

Hope this helps!

On Tue, Jun 13, 2017 at 7:51 AM, KantanLabs notifications@github.com wrote:

Hi David,

I did some look around and to be honest didn't figure out what exactly causes helgrade to complain. I modified some things in the parse method.

First, switched the global sentence count from int to unsigned long long int. If I am not wrong, for more than 32,767 sentences there would be errors.

Second, the printing function I took out of the main loop - this way they should work on parallel (added one more semaphore). If you want, I will send the code for review.

But I wanted to ask - how do I set the option for multithreading (the -t option) with the python wrapper?

Thanks, Dimitar.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BLLIP/bllip-parser/issues/16#issuecomment-308091846, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm5ZXatlUQ_XLMZ7b_3s3TExK_N329bks5sDnergaJpZM4BOS-a .

KantanLabs commented 7 years ago

Hello David,

Sorry for my late reply - being busy with lots of stuff.

Once I have a bit of free time I will submit the change for the long long int.

Regarding the race condition, it is possible that the history (FullHist) is not thread safe... but not yet verified.

I have, however, two questions:

  1. If I submit more code changes, will you review it? (maybe that sounds like a silly question, but to be honest I haven't worked on a github project like this before.)
  2. The whole idea for fixing the multithreading arose from the fact that I have a python rest-server which I want to connect to the parser and have it running as fast and as distributed as possible. This, however, will not be solved with having the parser multithreaded, because the python wrapper by itself parses the sentences one-by-one (sending and waiting for a response). So, having a python rest server will not help. So, my second question - can parseIt.C be extended with REST capabilities? to listen to a port, receive a request and then send the parse back (in, e.g., json format). I have been looking into pistache for such a solution, but still need to work on it. If you have any idea/suggestion/comment on this, that would be great.

Thanks, Kind regards, Dimitar.

dmcc commented 7 years ago

Hi Dimitar,

Sure, I'd be happy to review changes.

If you're planning to make a new REST server frontend, I'd recommend making it a separate file from parseIt.C (maybe parseREST.C?). Ideally, it would only need to call things in first-stage/PARSE/SimpleAPI.h though you will need to extend the parse() calls to use thread IDs again.

Hope this helps, David

On Tue, Jun 20, 2017 at 4:49 PM, KantanLabs notifications@github.com wrote:

Hello David,

Sorry for my late reply - being busy with lots of stuff.

Once I have a bit of free time I will submit the change for the long long int.

Regarding the race condition, it is possible that the history (FullHist) is not thread safe... but not yet verified.

I have, however, two questions:

  1. If I submit more code changes, will you review it? (maybe that sounds like a silly question, but to be honest I haven't worked on a github project like this before.)
  2. The whole idea for fixing the multithreading arose from the fact that I have a python rest-server which I want to connect to the parser and have it running as fast and as distributed as possible. This, however, will not be solved with having the parser multithreaded, because the python wrapper by itself parses the sentences one-by-one (sending and waiting for a response). So, having a python rest server will not help. So, my second question - can parseIt.C be extended with REST capabilities? to listen to a port, receive a request and then send the parse back (in, e.g., json format). I have been looking into pistache for such a solution, but still need to work on it. If you have any idea/suggestion/comment on this, that would be great.

Thanks, Kind regards, Dimitar.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BLLIP/bllip-parser/issues/16#issuecomment-309887110, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm5ZRdWED0fjdNQehnM7lu2levb3PfAks5sGDBFgaJpZM4BOS-a .

KantanLabs commented 7 years ago

Hi David,

After long search through the BLLIP parser code I am sort of giving up. I reached to a point where I think (one) problem comes from the "edgeSubFns". There is this array of functions that is accessed by each thread. I tried to work on this, but couldn't make it work.

Anyway, I tried a lot of other things (following gdb and valgrind's info) but coudln't make it work properly for multiple threads.

I would like to share my last version and maybe you or somebody can take over. Maybe I am on the right track and just missing something small (or maybe I am shoooting tooo far away).

Cheers, Dimitar.

dmcc commented 7 years ago

Hi Dimitar,

Thanks for the report and your work on this problem. It is definitely a challenging codebase. If you have time, it would be awesome if you checked in your changes to your fork of the parser for others who might want to take this on (I'm afraid I don't have the bandwidth at this point).

Take care, David

On Wed, Aug 2, 2017 at 3:13 PM, KantanLabs notifications@github.com wrote:

Hi David,

After long search through the BLLIP parser code I am sort of giving up. I reached to a point where I think (one) problem comes from the "edgeSubFns". There is this array of functions that is accessed by each thread. I tried to work on this, but couldn't make it work.

Anyway, I tried a lot of other things (following gdb and valgrind's info) but coudln't make it work properly for multiple threads.

I would like to share my last version and maybe you or somebody can take over. Maybe I am on the right track and just missing something small (or maybe I am shoooting tooo far away).

Cheers, Dimitar.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BLLIP/bllip-parser/issues/16#issuecomment-319813693, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm5ZbFPT5sCTnwKZ3-bmB6AEYYz6_M5ks5sUPRzgaJpZM4BOS-a .