Closed keien closed 10 years ago
which seems result from the words "dissappointed" and "meeting" not being lemmatized. When I looked up the sequences directly, I saw that it had these sequences:
Try the online demo with the sentence you wrote. As you can see, "meeting" is not lemmatized (I guess it's the weird grammar?) and neither is "dissapointed" (it's misspelled).
The obvious question is how Aditi lemmatized those words. I have no idea.
I'm not sure what's going on with some of the sequences that skip words
Some sequences are stripped of stop words, could that explain it?
Well the sequences with the skipped words occur in both the old and new databases, so they're not a problem.
It's possible that Aditi's SQL dump is inconsistent; in one run, there was a sentence that had the word "theater" in its text and was an exact match in both databases, yet the word accuracy check found an inaccuracy because one contained the word "theater" and the other contained the word "theatre". Made no sense to me.
Very strange. Is our code consistent?
I have no reason to believe that our code occasionally changes words to British English.
Our code really just moves the output of the parser into the database while dealing with errors that pop up along the way; we don't do any processing or modifications to the output strings, so it's more likely that problems of inconsistent parsing comes from the parser itself, which has been the case for inconsistencies that we've dealt with so far.
Heh. You're right about that, but the python interface probably doesn't Britishize words, and Aditi's code used the same Java library as we do, so it seems to be a mystery why "dissappointed" was lemmatized for her but nowhere else.
Well, I'm going to conclude that the issue is with Aditi's SQL dump, not with our code. I don't believe that our sequences are incorrect, but just so we're sure, you should double-check get_sequence
to make sure that the sequence creation process has no issues. If you think it's fine, go ahead and close this.
Looks like the same as the original java code, should be fine.
We don't seem to properly lemmatize all the words in the sequences.
For example, take the sentence, "If you're also looking for a serious relationship and would like to meet a real character, you won't be dissappointed meeting me."
My accuracy checker shows that some of the sequences that were in the SQL dump that we didn't have include:
...while some of the sequences that we have that the SQL dump doesn't have include:
...which seems result from the words "dissapointed" and "meeting" not being lemmatized. When I looked up the sequences directly, I saw that it had these sequences:
I'm not sure what's going on with some of the sequences that skip words, but for sequences like "be dissappointed meeting I", it's clear that the word "me" got lemmatized, while the others didn't.