brendano / stanford_corenlp_pywrapper

151 stars 59 forks source link

SockWrap: "unpack requires a string argument of length 8" #28

Open ayrtonmassey opened 9 years ago

ayrtonmassey commented 9 years ago

I've noticed this error a couple of times since switching to the new branch. Here's the server log of the exception:

INFO:__main__:Processing http://www.telegraph.co.uk/sport/rugbyunion/international/newzealand/10249335/Australia-29-New-Zealand-47-report.html...
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/whosaidwhat/whosaidwhat/analytics/__main__.py", line 67, in <module>
    jdoc = ss.parse_doc(text, raw=False)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 222, in parse_doc
    return self.send_command_and_parse_result(cmd, timeout, raw=raw)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 242, in send_command_and_parse_result
    data = self.send_command_and_get_string_result(cmd, timeout)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 275, in send_command_and_get_string_result
    size_info = struct.unpack('>Q', size_info_str)[0]
struct.error: unpack requires a string argument of length 8
INFO:CoreNLP_PyWrapper:Subprocess seems to be stopped, exit code -9
INFO:CoreNLP_PyWrapper:Subprocess seems to be stopped, exit code -9

I've taken a look at the code and I'm guessing the server isn't sending enough bytes in response? I'm not sure if that's caused by me throwing too much at it, or something to do with the new release.

brendano commented 9 years ago

something like that. i wonder if the java process is just crashing. could you save the data file being processed, and the script you're running, that reproduces the error?

ayrtonmassey commented 9 years ago

Unfortunately the script I'm running doesn't actually reproduce the error on the same input. The trimmed down version (which replicates the parts that use the wrapper) is as follows:

from stanford_corenlp_pywrapper import CoreNLP
filenames = ['input']

ss = CoreNLP(mode='coref', corenlp_jars=['/home/ayrton/corenlp/stanford-corenlp-full-2015-04-20/*'])

for filename in filenames:
    f = open(filename)
    text = f.read()
    f.close()
    print "Processing {filename}".format(filename=filename)
    jdoc = ss.parse_doc(text,raw=False)  # This is where it crashes

The input it crashed on is:

 It was the All Blacks' 15th win in their last 19 encounters against the Wallabies to maintain their dominance in the Bledisloe Cup, which they have held since 2003. .
 The optimism of a new start under McKenzie quickly died out, with dynamic winger Israel Folau hardly seeing the ball and debut fly-half Matt Toomua replaced by Quade Cooper on the hour. .
 The All Blacks hit the ground running and the Australians had to defend their line before Cruden put winger Ben Smith over in the right corner after James O'Connor left his wing to create the overlap in the third minute. .
 Christian Lealiifano reduced the margin with an eighth-minute penalty after Kieran Read was penalised for barging into Rob Simmons. .
 Kiwi skipper McCaw conceded a ruck penalty and Lealiifano kicked his second penalty to reduce the All Blacks' lead to one point. .
 McCaw redeemed himself when he won a ruck penalty for Cruden to kick the All Blacks to a 10-6 lead after 20 minutes. .
 Lealiifano kicked the Wallabies in front for the first time 12-10 with two penalties off ruck infringements. .
 But the All Blacks scored against the run of play when Cruden charged down Lealiifano's clearing kick to score and convert in the 29th minute for a 17-12 lead. .
 Worse was to come for the Wallabies minutes later when McCaw plunged over wide out for his ninth career Test try against Australia and push the All Blacks out to a 22-12 advantage. .
 Will Genia threw the Wallabies a lifeline minutes before the break with a sensational 60-metre try off a Michael Hooper break. Lealiifano converted and Australia trailed 22-19. .
 But Cruden kicked the All Blacks three points further ahead at halftime after the Wallabies were caught offside inside their own quarter. .
 The Wallabies began the second half on the front foot and forced a ruck penalty in front of the All Blacks post for Lealiifano's fifth straight penalty and trail 25-22. .
 But the All Blacks swiftly hit back for their bonus-point fourth try with centre Conrad Smith dashing over off Aaron Smith to score near the posts for a 32-22 lead after 51 minutes. .
 The Wallabies were pushed off the ball in a scrum and Steven Luatua took play close to the Australian line before they spread the ball out to right-winger Ben Smith for his second try as the rampant All Blacks swept to a 37-22 lead midway through the second half. .
 Cruden kicked his third penalty as New Zealand hit 40 points into the final quarter of the match. .
 But they weren't finished and Ben Smith became the first All Black to score a hat-trick of tries since Doug Howlett eight years ago with his third touchdown eight minutes from the end. .
 O'Connor scored Australia's consolation second try in the final minute. .

The full program takes its input from a redis queue and has processed at most ~600 documents of varying lengths before crashing with struct.error. I can dodge this and continue if I except struct.error but that's obviously not a long-term solution. I think if you want to reproduce the error you'll have to throw documents at it until it crashes, because it seems that it's caused by running the server for a long period of time rather than by specific input.

brendano commented 9 years ago

ok. state dependence is scary. if you restart the server on the struct.error (easy way is create a new CoreNLP object), does the doc process ok?

On Tue, Jul 7, 2015 at 11:28 AM, Ayrton Massey notifications@github.com wrote:

Unfortunately the script I'm running doesn't actually reproduce the error on the same input. The trimmed down version (which replicates the parts that use the wrapper) is as follows:

import redis, jsonfrom stanford_corenlp_pywrapper import CoreNLP filenames = ['input'] for filename in filenames: f = open(filename) text = f.read() f.close() ss = CoreNLP(mode='coref', corenlp_jars=['/home/ayrton/corenlp/stanford-corenlp-full-2015-04-20/*']) print "Processing {filename}".format(filename=filename) jdoc = ss.parse_doc(text,raw=False) # This is where it crashes

The input it crashed on is:

It was the All Blacks' 15th win in their last 19 encounters against the Wallabies to maintain their dominance in the Bledisloe Cup, which they have held since 2003. . The optimism of a new start under McKenzie quickly died out, with dynamic winger Israel Folau hardly seeing the ball and debut fly-half Matt Toomua replaced by Quade Cooper on the hour. . The All Blacks hit the ground running and the Australians had to defend their line before Cruden put winger Ben Smith over in the right corner after James O'Connor left his wing to create the overlap in the third minute. . Christian Lealiifano reduced the margin with an eighth-minute penalty after Kieran Read was penalised for barging into Rob Simmons. . Kiwi skipper McCaw conceded a ruck penalty and Lealiifano kicked his second penalty to reduce the All Blacks' lead to one point. . McCaw redeemed himself when he won a ruck penalty for Cruden to kick the All Blacks to a 10-6 lead after 20 minutes. . Lealiifano kicked the Wallabies in front for the first time 12-10 with two penalties off ruck infringements. . But the All Blacks scored against the run of play when Cruden charged down Lealiifano's clearing kick to score and convert in the 29th minute for a 17-12 lead. . Worse was to come for the Wallabies minutes later when McCaw plunged over wide out for his ninth career Test try against Australia and push the All Blacks out to a 22-12 advantage. . Will Genia threw the Wallabies a lifeline minutes before the break with a sensational 60-metre try off a Michael Hooper break. Lealiifano converted and Australia trailed 22-19. . But Cruden kicked the All Blacks three points further ahead at halftime after the Wallabies were caught offside inside their own quarter. . The Wallabies began the second half on the front foot and forced a ruck penalty in front of the All Blacks post for Lealiifano's fifth straight penalty and trail 25-22. . But the All Blacks swiftly hit back for their bonus-point fourth try with centre Conrad Smith dashing over off Aaron Smith to score near the posts for a 32-22 lead after 51 minutes. . The Wallabies were pushed off the ball in a scrum and Steven Luatua took play close to the Australian line before they spread the ball out to right-winger Ben Smith for his second try as the rampant All Blacks swept to a 37-22 lead midway through the second half. . Cruden kicked his third penalty as New Zealand hit 40 points into the final quarter of the match. . But they weren't finished and Ben Smith became the first All Black to score a hat-trick of tries since Doug Howlett eight years ago with his third touchdown eight minutes from the end. . O'Connor scored Australia's consolation second try in the final minute. .

The full program takes its input from a redis queue and has processed at most ~600 documents of varying lengths before crashing with struct.error. I can dodge this and continue if I except struct.error but that's obviously not a long-term solution. I think if you want to reproduce the error you'll have to throw documents at it until it crashes, because it seems that it's caused by running the server for a long period of time rather than by specific input.

— Reply to this email directly or view it on GitHub https://github.com/brendano/stanford_corenlp_pywrapper/issues/28#issuecomment-119239276 .

ayrtonmassey commented 9 years ago

Right now I'm doing this:

import struct
...
try:
    jdoc = ss.parse_doc(text,raw=False)
except struct.error:
    continue

To bypass the struct.error, but I haven't left it running long enough to reproduce the bug.

As far as I know, this just causes me to lose the data from the document that caused the exception and subsequent documents process fine - but as I said, I'd need to leave it running and watch the logs to confirm that it's happened & the process has recovered.

brendano commented 9 years ago

ok. would be good to know if the python wrapper decides to restart the process after the bad one happens. another idea, if there's weird state dependence going on, you could try just reparsing it and maybe it will work. but maybe it wont. sigh.

also apart from the root cause we need the following improvements

  @SuppressWarnings("deprecated")
  public static boolean annotateTimed(Annotation ann) {
    // Don't time the initial classifier setup.
    nlp.get();

    Thread annotateThread = new Thread(() -> {
      nlp.get().annotate(ann);
    });
    annotateThread.start();
    // Spend at most 4 seconds annotating any document; after this it might as well be hanging!
    try {
      annotateThread.join(MaxAnnotationTime);
    } catch (InterruptedException e) {
    }
    if (!annotateThread.isAlive()) {
      // Excellent, it finished.
      return true;
    }

    // Yes, I know.
    annotateThread.stop();
    return false;
  }
ayrtonmassey commented 9 years ago

Left the process running overnight, and here's a relevant portion of the log:

2015-07-08 02:38:50 [__main__] INFO: Processing http://www.dailytelegraph.com.au/news/nsw/nsw-government-bans-smoking-in-national-parks-to-reduce-bushfire-risk/story-fnpn118l-1227123243509...
2015-07-08 02:39:00 [__main__] ERROR: unpack requires a string argument of length 8
Traceback (most recent call last):
  File "/home/whosaidwhat/whosaidwhat/analytics/__main__.py", line 75, in <module>
    jdoc = ss.parse_doc(text, raw=False)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 222, in parse_doc
    return self.send_command_and_parse_result(cmd, timeout, raw=raw)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 242, in send_command_and_parse_result
    data = self.send_command_and_get_string_result(cmd, timeout)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 275, in send_command_and_get_string_result
    size_info = struct.unpack('>Q', size_info_str)[0]
error: unpack requires a string argument of length 8
2015-07-08 02:39:00 [__main__] ERROR: Ignored struct.error, continuing...
2015-07-08 02:39:00 [__main__] INFO: Processing http://www.londonwelsh.org/whats-on/rugby-2/...
INFO:CoreNLP_PyWrapper:Subprocess seems to be stopped, exit code -9
INFO:CoreNLP_PyWrapper:mode given as 'coref' so setting annotators: tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref
INFO:CoreNLP_PyWrapper:Starting java subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -XX:ParallelGCThreads=1 -cp '/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/lib/*:/home/whosaidwhat/corenlp/stanford-corenlp-full-2015-04-20
/*'      corenlp.SocketServer --outpipe /tmp/corenlp_pywrap_pipe_pypid=11738_time=1436279677.29  --configdict '{"annotators": "tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref"}'
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [3.5 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [8.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [4.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [7.3 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator entitymentions
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [3.5 sec].
Adding annotator dcoref
INFO:CoreNLP_JavaServer: CoreNLP pipeline initialized.
INFO:CoreNLP_JavaServer: Waiting for commands on stdin
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.
INFO:CoreNLP_JavaServer: INPUT: 1 documents, 1197 characters, 260 tokens, 1197.0 char/doc, 260.0 tok/doc RATES: 0.110 doc/sec, 28.7 tok/sec

2015-07-08 02:39:50 [__main__] INFO: Processing http://www.dailypost.co.uk/sport/...
INFO:CoreNLP_JavaServer: INPUT: 2 documents, 1664 characters, 355 tokens, 832.0 char/doc, 177.5 tok/doc RATES: 0.187 doc/sec, 33.2 tok/sec

What you're seeing there is that it tries to process a document, fails due to struct.error, then the CoreNLP server restarts and the process continues parsing documents.

I haven't made any changes to the wrapper, so that happens by default.