Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
40 stars 16 forks source link

CoreNLP error after combine sentence integration #86

Closed keien closed 10 years ago

keien commented 10 years ago

I just merged in the sentence combining into my branch and I'm getting this error:

Extracting document text and metadata
        {'input': "Even or odd, of all days in the year, Come Lammas-eve at night shall she be fourteen. Susan and she--God rest all Christian souls!-- Were of an age: well, Susan is with God; She was too good for me: but, as I said, On Lammas-eve at night shall she be fourteen; That shall she, marry; I remember it well. 'Tis since the earthquake now eleven years; And she was wean'd,--I never shall forget it,-- Of all the days of the year, upon that day: For I had then laid wormwood to my dug, Sitting in the sun under the dove-house wall; My lord and you were then at Mantua:-- Nay, I do bear a brain:--but, as I said, When it did taste the wormwood on the nipple Of my dug and felt it bitter, pretty fool, To see it tetchy and fall out with the dug! Shake quoth the dove-house: 'twas no need, I trow, To bid me trudge: And since that time it is eleven years; For then she could stand alone; nay, by the rood, She could have run and waddled all about; For even the day before, she broke her brow: And then my husband--God be with his soul! A' was a merry man--took up the child: 'Yea,' quoth he, 'dost thou fall upon thy face? Thou wilt fall backward when thou hast more wit; Wilt thou not, Jule?' and, by my holidame, The pretty wretch left crying and said 'Ay.' To see, now, how a jest shall come about! I warrant, an I should live a thousand years, I never should forget it: 'Wilt thou not, Jule?' quoth he; And, pretty fool, it stinted and said 'Ay.' ", 'output': "Even or odd, of all days in the year, Come Lammas-eve at night shall she be fourteen. Susan and she--God rest all Christian souls!-- Were of an age: well, Susan is with God; She was too good for me: but, as I said, On Lammas-eve at night shall she be fourteen; That shall she, marry; I remember it well. 'Tis since the earthquake now eleven years; And she was wean'd,--I never shall forget it,-- Of all the days of the year, upon that day: For I had then laid wormwood to my dug, Sitting in the sun under the dove-house wall; My lord and you were then at Mantua:-- Nay, I do bear a brain:--but, as I said, When it did taste the wormwood on the nipple Of my dug and felt it bitter, pretty fool, To see it tetchy and fall out with the dug! Shake quoth the dove-house: 'twas no need, I trow, To bid me trudge: And since that time it is eleven years; For then she could stand alone; nay, by the rood, She could have run and waddled all about; For even the day before, she broke her brow: And then my husband--God be with his soul! A' was a merry man--took up the child: 'Yea,' quoth he, 'dost thou fall upon thy face? Thou wilt fall backward when thou hast more wit; Wilt thou not, Jule?' and, by my holidame, The pretty wretch left crying and said 'Ay.' To see, now, how a jest shall come about! I warrant, an I should live a thousand years, I never should forget it: 'Wilt thou not, Jule?' quoth he; And, pretty fool, it stinted and said 'Ay.' \r\n", 'error': 'CoreNLP terminates abnormally while parsing'}
'CoreNLP process terminates abnormally while parsing'
Traceback (most recent call last):
  File "run_pipeline.py", line 18, in <module>
    collection_processor.process(collection_dir, structure_file, extension, False)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/collectionprocessor.py", line 52, in process
    docstruc_filename, filename_extension)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/collectionprocessor.py", line 132, in extract_record_metadata
    filename))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 42, in extract
    units = self.extract_unit_information(self.document_structure, doc)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 85, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 103, in extract_unit_information
    True)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/structureextractor.py", line 133, in get_sentences_from_text
    return self.str_proc.tokenize(text)
  File "/home/keien/dev/wordseer_flask/lib/wordseerbackend/wordseerbackend/stringprocessor.py", line 35, in tokenize
    parsed_text = self.parser.raw_parse(txt)
  File "/home/keien/dev/wordseer_flask/venv/local/lib/python2.7/site-packages/corenlp-1.0-py2.7.egg/corenlp/corenlp.py", line 443, in raw_parse
    raise e
corenlp.corenlp.ProcessError: 'CoreNLP process terminates abnormally while parsing'
abendebury commented 10 years ago

what command gives you this error?

keien commented 10 years ago

run_pipeline on r_and_j

keien commented 10 years ago

Do you know if CoreNLP has a max sentence length or something? Or maybe it's unhappy with the newline characters at the end?

abendebury commented 10 years ago

I don't think it has a max sentence length... I'm going to take a look at this right now.

keien commented 10 years ago

hang on I think I need to push my branch up first

keien commented 10 years ago

Never mind it's fine; the branch is removing-readerwriter by the way.

keien commented 10 years ago

There's also a lot of unit tests not passing, some of which doesn't make sense to me because the branch is already merged from master

abendebury commented 10 years ago

Oh, heh, I see what's happening here.

The original code had a limitation in place for length of a sentence, probably to conserve resources.

https://github.com/Wordseer/wordseer_flask/blob/master/lib/wordseerbackend/wordseerbackend/stringprocessor.py#L48

The sentence it chokes on is this:

you men, you beasts, That quench the fire of your pernicious rage With purple fountains issuing from your veins, On pain of torture, from those bloody hands Throw your mistemper'd weapons to the ground, And hear the sentence of your moved prince.

Found on line 388.

Why would the unit tests be passing? They were failing when you branched this branch from handling-duplicates. For the most part the fixes shouldn't be too bad.

abendebury commented 10 years ago

We could increase the limit or remove it entirely and hope that nobody feeds in a sentence so long that it breaks something.

keien commented 10 years ago

Most of them are actually just these:

DetachedInstanceError: Instance <User at 0x7f0a1b601710> is not bound to a Session; attribute refresh operation cannot proceed

I don't even know what this means.

We should remove the limit but add a memory check or something so that if a sentence is too long and can't be processed, we throw an error and let the user know.

keien commented 10 years ago

This would be much easier if I understood how mock works...do you have something I can refer to to figure it out?

abendebury commented 10 years ago

Those crop up when another unit test is breaking, it's a sort of domino effect. It's best to ignore those and fix the other problems, they'll fix themselves.

Well, the limit is pretty much a memory check; perhaps instead of just failing the whole pipeline it could just go to the next sentence.

Here's some documentation for mock.

http://www.voidspace.org.uk/python/mock/

keien commented 10 years ago

The problem is that because we intend this to be run locally, we can't predict everyone's memory limits. This is also something the duplication handler can potentially have an issue with; the dictionary used for look-ups refreshes every 50 sentences, but it could break if there's not enough memory to store all of that.

keien commented 10 years ago

Also, maybe we should update the tests (and method names) because they don't all do the same thing as when the tests were written. For example, the return value of StringProcessor.parse doesn't really matter because we write everything to the database on the fly now; same with StringProcessor.tokenize_from_raw, though for that one it's probably convenient to return the list of sentences.

abendebury commented 10 years ago

How would your solution address not knowing everyone's memory limits?

I think you're right that parse no longer needs to return anything; however, tokenize_from_raw definitely does need to return things - it and tokenize used in lots of places that need to get sentences and process them rather than writing to database straightaway.

keien commented 10 years ago

I'm not sure, really. I'd have to look into how Python handles memory issues. Is there a temporary solution we can put in place for now?

abendebury commented 10 years ago

Yes, for now let's just remove the check since we won't be handling any unreasonable data. I'll do that and make an issue to make a better check.

keien commented 10 years ago

kk sounds good

abendebury commented 10 years ago

89

I'll go ahead and close this issue.

abendebury commented 10 years ago

"Fix" has been pushed to removing-readerwriter.