CoreNLP Issue? - Githubissues

keien commented 10 years ago

This just happened and I'm not sure why. Did the recent corenlp update break something?

Traceback (most recent call last):
  File "runpreprocessor.py", line 27, in <module>
    cp_run(collection_dir, structure_file, extension, project)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/collectionprocessor.py", line 213, in cp_run
    False)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/collectionprocessor.py", line 61, in process
    docstruc_filename, filename_extension)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/collectionprocessor.py", line 135, in extract_record_metadata
    filename))
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 60, in extract
    doc)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 121, in extract_unit_information
    node))
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 130, in extract_unit_information
    node, True)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 211, in get_sentences
    sentences = self.get_sentences_from_text(sentence_text, tokenize)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/structureextractor.py", line 169, in get_sentences_from_text
    return self.str_proc.tokenize(text)
  File "/home/keien/dev/wordseer_flask/app/preprocessor/stringprocessor.py", line 39, in tokenize
    sentences.extend(tokenize_from_raw(sentence, sentence_text))
  File "/home/keien/dev/wordseer_flask/app/preprocessor/stringprocessor.py", line 336, in tokenize_from_raw
    part_of_speech = word_data[1]["PartOfSpeech"]
KeyError: 'PartOfSpeech'

abendebury commented 10 years ago

Does this happen for any input?

abendebury commented 10 years ago

And on which branch?

abendebury commented 10 years ago

Happens on Friends describe me as fun, loyal, excellent sense of humour with a hint of sarcasm, full of life and energy, up for great laughs and enjoy most things in life [usually sensibly LOL!].

The ] breaks something.

abendebury commented 10 years ago

Hm, well the problem is that the result from the java library comes like this:

[Text=My CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=PRP$ Lemma=my] [Text=name CharacterOffsetBegin=3 CharacterOffsetEnd=7 PartOfSpeech=NN Lemma=name] [Text=is CharacterOffsetBegin=8 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=be] [Text=Frank CharacterOffsetBegin=11 CharacterOffsetEnd=16 PartOfSpeech=NNP Lemma=Frank] [Text=[ CharacterOffsetBegin=17 CharacterOffsetEnd=18 PartOfSpeech=NNP Lemma=[] [Text=yes CharacterOffsetBegin=18 CharacterOffsetEnd=21 PartOfSpeech=RB Lemma=yes] [Text=it CharacterOffsetBegin=22 CharacterOffsetEnd=24 PartOfSpeech=PRP Lemma=it] [Text=is CharacterOffsetBegin=25 CharacterOffsetEnd=27 PartOfSpeech=VBZ Lemma=be] [Text=] CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=CD Lemma=]] [Text=. CharacterOffsetBegin=28 CharacterOffsetEnd=29 PartOfSpeech=. Lemma=.]

So I'm not sure what kind of regular expression can separate all the different [...] blocks even when they contain a ] character.

We might have to un-escape the words in the preprocessor.

keien commented 10 years ago

Yeah I'm fine with that. Revert the parser back, then put a catcher for the brackets in tokenize_from_raw where the words are read in to translate them back to the right characters.

keien commented 10 years ago

We can close this now right?

Wordseer / wordseer

CoreNLP Issue? #155