Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
40 stars 16 forks source link

Even though we split sentences, parse still sometimes get double sentences #134

Closed keien closed 10 years ago

keien commented 10 years ago

The common factor seems to be an etc. followed by a period. Here are the anomalies:

[{'parsetree': '(ROOT (SINV (S (NP (PRP I)) (VP (VBP am) (VP (VBG looking) (ADVP (RB specifically)) (PP (IN for) (NP (JJ american) (NNS guys)))))) (, ,) (VP (VBP love)) (NP (NP (DT the) (NN accent)) (, ,) (NP (FW etc.))) (. .)))', 'text': 'I am looking specifically for american guys, love the accent, etc.', 'dependencies': [('nsubj', 'looking', '3', 'I', '1'), ('aux', 'looking', '3', 'am', '2'), ('ccomp', 'love', '9', 'looking', '3'), ('advmod', 'looking', '3', 'specifically', '4'), ('amod', 'guys', '7', 'american', '6'), ('prep_for', 'looking', '3', 'guys', '7'), ('root', 'ROOT', '0', 'love', '9'), ('det', 'accent', '11', 'the', '10'), ('nsubj', 'love', '9', 'accent', '11'), ('appos', 'accent', '11', 'etc.', '13')], 'words': [('I', {'CharacterOffsetEnd': '1', 'Lemma': 'I', 'PartOfSpeech': 'PRP', 'CharacterOffsetBegin': '0'}), ('am', {'CharacterOffsetEnd': '4', 'Lemma': 'be', 'PartOfSpeech': 'VBP', 'CharacterOffsetBegin': '2'}), ('looking', {'CharacterOffsetEnd': '12', 'Lemma': 'look', 'PartOfSpeech': 'VBG', 'CharacterOffsetBegin': '5'}), ('specifically', {'CharacterOffsetEnd': '25', 'Lemma': 'specifically', 'PartOfSpeech': 'RB', 'CharacterOffsetBegin': '13'}), ('for', {'CharacterOffsetEnd': '29', 'Lemma': 'for', 'PartOfSpeech': 'IN', 'CharacterOffsetBegin': '26'}), ('american', {'CharacterOffsetEnd': '38', 'Lemma': 'american', 'PartOfSpeech': 'JJ', 'CharacterOffsetBegin': '30'}), ('guys', {'CharacterOffsetEnd': '43', 'Lemma': 'guy', 'PartOfSpeech': 'NNS', 'CharacterOffsetBegin': '39'}), (',', {'CharacterOffsetEnd': '44', 'Lemma': ',', 'PartOfSpeech': ',', 'CharacterOffsetBegin': '43'}), ('love', {'CharacterOffsetEnd': '49', 'Lemma': 'love', 'PartOfSpeech': 'VBP', 'CharacterOffsetBegin': '45'}), ('the', {'CharacterOffsetEnd': '53', 'Lemma': 'the', 'PartOfSpeech': 'DT', 'CharacterOffsetBegin': '50'}), ('accent', {'CharacterOffsetEnd': '60', 'Lemma': 'accent', 'PartOfSpeech': 'NN', 'CharacterOffsetBegin': '54'}), (',', {'CharacterOffsetEnd': '61', 'Lemma': ',', 'PartOfSpeech': ',', 'CharacterOffsetBegin': '60'}), ('etc.', {'CharacterOffsetEnd': '65', 'Lemma': 'etc.', 'PartOfSpeech': 'FW', 'CharacterOffsetBegin': '62'}), ('.', {'CharacterOffsetEnd': '66', 'Lemma': '.', 'PartOfSpeech': '.', 'CharacterOffsetBegin': '65'})]}, {'parsetree': '(ROOT (NP (. .)))', 'text': '.', 'dependencies': [], 'words': [('.', {'CharacterOffsetEnd': '67', 'Lemma': '.', 'PartOfSpeech': '.', 'CharacterOffsetBegin': '66'})]}]
[{'parsetree': '(ROOT (NP (NP (NNP WE) (NNP COULD) (NNP DO) (NNP DADDY\\/DAUGHTER)) (, ,) (NP (NP (NNP STEP) (NNP DAUGHTER)) (, ,) (NP (NNP NEICE) (, ,) (NNP ETC.))) (. .)))', 'text': 'WE COULD DO DADDY/DAUGHTER, STEP DAUGHTER, NEICE, ETC.', 'dependencies': [('nn', 'DADDY\\/DAUGHTER', '4', 'WE', '1'), ('nn', 'DADDY\\/DAUGHTER', '4', 'COULD', '2'), ('nn', 'DADDY\\/DAUGHTER', '4', 'DO', '3'), ('root', 'ROOT', '0', 'DADDY\\/DAUGHTER', '4'), ('nn', 'DAUGHTER', '7', 'STEP', '6'), ('appos', 'DADDY\\/DAUGHTER', '4', 'DAUGHTER', '7'), ('nn', 'ETC.', '11', 'NEICE', '9'), ('appos', 'DAUGHTER', '7', 'ETC.', '11')], 'words': [('WE', {'CharacterOffsetEnd': '2', 'Lemma': 'WE', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '0'}), ('COULD', {'CharacterOffsetEnd': '8', 'Lemma': 'COULD', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '3'}), ('DO', {'CharacterOffsetEnd': '11', 'Lemma': 'DO', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '9'}), ('DADDY\\/DAUGHTER', {'CharacterOffsetEnd': '26', 'Lemma': 'DADDY\\/DAUGHTER', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '12'}), (',', {'CharacterOffsetEnd': '27', 'Lemma': ',', 'PartOfSpeech': ',', 'CharacterOffsetBegin': '26'}), ('STEP', {'CharacterOffsetEnd': '32', 'Lemma': 'STEP', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '28'}), ('DAUGHTER', {'CharacterOffsetEnd': '41', 'Lemma': 'DAUGHTER', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '33'}), (',', {'CharacterOffsetEnd': '42', 'Lemma': ',', 'PartOfSpeech': ',', 'CharacterOffsetBegin': '41'}), ('NEICE', {'CharacterOffsetEnd': '48', 'Lemma': 'NEICE', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '43'}), (',', {'CharacterOffsetEnd': '49', 'Lemma': ',', 'PartOfSpeech': ',', 'CharacterOffsetBegin': '48'}), ('ETC.', {'CharacterOffsetEnd': '53', 'Lemma': 'ETC.', 'PartOfSpeech': 'NNP', 'CharacterOffsetBegin': '50'}), ('.', {'CharacterOffsetEnd': '54', 'Lemma': '.', 'PartOfSpeech': '.', 'CharacterOffsetBegin': '53'})]}, {'parsetree': '(ROOT (NP (. .)))', 'text': '.', 'dependencies': [], 'words': [('.', {'CharacterOffsetEnd': '55', 'Lemma': '.', 'PartOfSpeech': '.', 'CharacterOffsetBegin': '54'})]}]
abendebury commented 10 years ago

What exactly do you mean?

keien commented 10 years ago

this error

It technically shouldn't happen because we split sentences beforehand, so it's probably NLTK and CoreNLP splitting sentences differently.

abendebury commented 10 years ago

What input text does it happen with?

keien commented 10 years ago

First one is: "I am looking specifically for american guys, love the accent, etc..."

Second one is: "WE COULD DO DADDY/DAUGHTER, STEP DAUGHTER, NEICE, ETC.."

abendebury commented 10 years ago

Yes, NLTK considers both of those to be one sentence whereas CoreNLP thinks they are each two. We could simply concatenate sentences if CoreNLP returns multiple when NLTK returns one, unless we should stick to the CoreNLP sentence division.

keien commented 10 years ago

fixed in 1db753c691d856e6b01ed47c471b06208249db0b but we should add it to the error message screen in the UI

abendebury commented 10 years ago

That's what line 71 is doing.

keien commented 10 years ago

Okay, then it'll be fine