Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
40 stars 16 forks source link

CoreNLP error in corenlp package #99

Closed keien closed 10 years ago

keien commented 10 years ago

Error:

{'input': "O, then, I see Queen Mab hath been with you. She is the fairies' midwife, and she comes In shape no bigger than an agate-stone On the fore-finger of an alderman, Drawn with a team of little atomies Athwart men's noses as they lie asleep; Her wagon-spokes made of long spiders' legs, The cover of the wings of grasshoppers, The traces of the smallest spider's web, The collars of the moonshine's watery beams, Her whip of cricket's bone, the lash of film, Her wagoner a small grey-coated gnat, Not so big as a round little worm Prick'd from the lazy finger of a maid; Her chariot is an empty hazel-nut Made by the joiner squirrel or old grub, Time out o' mind the fairies' coachmakers. And in this state she gallops night by night Through lovers' brains, and then they dream of love; O'er courtiers' knees, that dream on court'sies straight, O'er lawyers' fingers, who straight dream on fees, O'er ladies ' lips, who straight on kisses dream, Which oft the angry Mab with blisters plagues, Because their breaths with sweetmeats tainted are: Sometime she gallops o'er a courtier's nose, And then dreams he of smelling out a suit; And sometime comes she with a tithe-pig's tail Tickling a parson's nose as a' lies asleep, Then dreams, he of another benefice: Sometime she driveth o'er a soldier's neck, And then dreams he of cutting foreign throats, Of breaches, ambuscadoes, Spanish blades, Of healths five-fathom deep; and then anon Drums in his ear, at which he starts and wakes, And being thus frighted swears a prayer or two And sleeps again. This is that very Mab That plats the manes of horses in the night, And bakes the elflocks in foul sluttish hairs, Which once untangled, much misfortune bodes: This is the hag, when maids lie on their backs, That presses them and learns them first to bear, Making them women of good carriage: This is she-- ", 'output': "O, then, I see Queen Mab hath been with you. She is the fairies' midwife, and she comes In shape no bigger than an agate-stone On the fore-finger of an alderman, Drawn with a team of little atomies Athwart men's noses as they lie asleep; Her wagon-spokes made of long spiders' legs, The cover of the wings of grasshoppers, The traces of the smallest spider's web, The collars of the moonshine's watery beams, Her whip of cricket's bone, the lash of film, Her wagoner a small grey-coated gnat, Not so big as a round little worm Prick'd from the lazy finger of a maid; Her chariot is an empty hazel-nut Made by the joiner squirrel or old grub, Time out o' mind the fairies' coachmakers. And in this state she gallops night by night Through lovers' brains, and then they dream of love; O'er courtiers' knees, that dream on court'sies straight, O'er lawyers' fingers, who straight dream on fees, O'er ladies ' lips, who straight on kisses dream, Which oft the angry Mab with blisters plagues, Because their breaths with sweetmeats tainted are: Sometime she gallops o'er a courtier's nose, And then dreams he of smelling out a suit; And sometime comes she with a tithe-pig's tail Tickling a parson's nose as a' lies asleep, Then dreams, he of another benefice: Sometime she driveth o'er a soldier's neck, And then dreams he of cutting foreign throats, Of breaches, ambuscadoes, Spanish blades, Of healths five-fathom deep; and then anon Drums in his ear, at which he starts and wakes, And being thus frighted swears a prayer or two And sleeps again. This is that very Mab That plats the manes of horses in the night, And bakes the elflocks in foul sluttish hairs, Which once untangled, much misfortune bodes: This is the hag, when maids lie on their backs, That presses them and learns them first to bear, Making them women of good carriage: This is she-- \r\n", 'error': 'CoreNLP terminates abnormally while parsing'}
'CoreNLP process terminates abnormally while parsing'
Traceback (most recent call last):
  File "run_pipeline.py", line 18, in <module>
    collection_processor.process(collection_dir, structure_file, extension, False)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/collectionprocessor.py", line 52, in process
    docstruc_filename, filename_extension)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/collectionprocessor.py", line 132, in extract_record_metadata
    filename))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 42, in extract
    units = self.extract_unit_information(self.document_structure, doc)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 103, in extract_unit_information
    True)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 133, in get_sentences_from_text
    return self.str_proc.tokenize(text)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/stringprocessor.py", line 35, in tokenize
    parsed_text = self.parser.raw_parse(txt)
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/corenlp/corenlp.py", line 443, in raw_parse
    raise e
corenlp.corenlp.ProcessError: 'CoreNLP process terminates abnormally while parsing'
keien commented 10 years ago

Email Aditi about this error.

abendebury commented 10 years ago

Email sent.

silverasm commented 10 years ago

Ah, probably, one of the sentences is too long. It runs out of memory. If I remember correctly, parsing is something like an N^3 memory operation, where N is the length of the sentence.

I also notice that you're parsing and sentence-tokenizing in the same pass.

What you could do is two separate passes over the input text. One to split into sentences, and the next to parse. Then, you can fail more gracefully on the sentences that are too long.

In the old pipeline, that's how I used to do it. The stanford parser in Java used to have a flag you could set that would make it fail gracefully for sentences like this. I used to set the maximum length to something like 40 words (which is a pretty generous as the average in English is something closer to 15 words). In this case, I would try the following (once you have split the text into sentences).

  1. Set a flag in the processing pipeline for the max number of words in sentences sent to the parser.
  2. If a sentence is longer, split it either by a. brute force, at the limit b. on reasonable punctuation marks, like commas, falling back to the brute force limit if there aren't
  3. In the output, reconcile the indexes to be relative to the original sentences.

There is no great solution, though, because this misses long-range dependencies.

abendebury commented 10 years ago

What you could do is two separate passes over the input text. One to split into sentences, and the next to parse. Then, you can fail more gracefully on the sentences that are too long.

I'm not sure that the python package we're using to interface with the java library would let us do that.

Suggestion one and two seem suitable, though... we'll check with Professor Hearst to see what she thinks.

keien commented 10 years ago

@silverasm how would we do two passes over the text where we split the sentence in one pass and parse in the other?

abendebury commented 10 years ago

Marking this as closed since we've found a solution.