Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
42 stars 16 forks source link

Issues with sequence lemmatization #153

Closed keien closed 10 years ago

keien commented 10 years ago

We don't seem to properly lemmatize all the words in the sequences.

For example, take the sentence, "If you're also looking for a serious relationship and would like to meet a real character, you won't be dissappointed meeting me."

My accuracy checker shows that some of the sequences that were in the SQL dump that we didn't have include:

...while some of the sequences that we have that the SQL dump doesn't have include:

...which seems result from the words "dissapointed" and "meeting" not being lemmatized. When I looked up the sequences directly, I saw that it had these sequences:

I'm not sure what's going on with some of the sequences that skip words, but for sequences like "be dissappointed meeting I", it's clear that the word "me" got lemmatized, while the others didn't.

abendebury commented 10 years ago

which seems result from the words "dissappointed" and "meeting" not being lemmatized. When I looked up the sequences directly, I saw that it had these sequences:

Try the online demo with the sentence you wrote. As you can see, "meeting" is not lemmatized (I guess it's the weird grammar?) and neither is "dissapointed" (it's misspelled).

The obvious question is how Aditi lemmatized those words. I have no idea.

I'm not sure what's going on with some of the sequences that skip words

Some sequences are stripped of stop words, could that explain it?

keien commented 10 years ago

Well the sequences with the skipped words occur in both the old and new databases, so they're not a problem.

It's possible that Aditi's SQL dump is inconsistent; in one run, there was a sentence that had the word "theater" in its text and was an exact match in both databases, yet the word accuracy check found an inaccuracy because one contained the word "theater" and the other contained the word "theatre". Made no sense to me.

abendebury commented 10 years ago

Very strange. Is our code consistent?

keien commented 10 years ago

I have no reason to believe that our code occasionally changes words to British English.

Our code really just moves the output of the parser into the database while dealing with errors that pop up along the way; we don't do any processing or modifications to the output strings, so it's more likely that problems of inconsistent parsing comes from the parser itself, which has been the case for inconsistencies that we've dealt with so far.

abendebury commented 10 years ago

Heh. You're right about that, but the python interface probably doesn't Britishize words, and Aditi's code used the same Java library as we do, so it seems to be a mystery why "dissappointed" was lemmatized for her but nowhere else.

keien commented 10 years ago

Well, I'm going to conclude that the issue is with Aditi's SQL dump, not with our code. I don't believe that our sequences are incorrect, but just so we're sure, you should double-check get_sequence to make sure that the sequence creation process has no issues. If you think it's fine, go ahead and close this.

abendebury commented 10 years ago

Looks like the same as the original java code, should be fine.