Closed keien closed 10 years ago
So far, the only failing test in the pipeline is test_parse
for the StringProcessor
since you've made a lot of changes. I'm working on that.
There's two other errors in the application due to Hassan's branch, so once he fixes those we can merge in the changes and those issues will be resolved.
That's good, I was afraid there were tons of errors. Let me know if you want to change the way parse behaves or split it up or something to make unit testing easier.
Making the default behavior of save
to actually write to the database fixed a lot of the errors, we just have to make sure the pipeline passes in the correct argument.
Splitting up parse
might be a good idea, it's quite long. I'll look at that.
Here's something I noticed: If there is more than one word that's not found in the database, the program will fail since it tries to set 0 as an ID for the Word
s.
So the main problem here is outlined here: https://github.com/Wordseer/wordseer_flask/blob/removing-readerwriter/lib/wordseerbackend/wordseerbackend/stringprocessor.py#L104
The one case I know of is apostrophes - words like "we'll" splits into either "we" and "ll" or "we" and "'ll" (the latter with the apostrophe included). I don't know what other cases there are, but if we can sort this out, we can revert to the old filter_by
and there should never be a case when the filter_by
returns anything but one result, as Word
is inherently unique on word
, lemma
, and part_of_speech
.
The other option is to figure out how to split sentences and parse separately. Then, during structure extractions, we can just handle sentences and not save words, and then save words and dependencies at the same time, avoiding the problem altogether. I think this is the better solution, but we need to figure out how to split up the parsing process.
The other option is to figure out how to split sentences and parse separately.
I think the parser splits sentences before it parses, why do we need to do the splitting?
Well we want the parser to output the split sentence before it parses, because we want to do the splitting and parsing in different parts of the pipeline. Hence:
Then, during structure extractions, we can just handle sentences and not save words, and then save words and dependencies at the same time
Well, you can't save sentences without saving their words too, unless you want to save sentences first and then add in their words.
I don't understand why there would be a mismatch or why we'd be getting we
'll
sometimes and we
ll
other times... unless it's not an actual '
character and instead some sort of look-alike. I recently consolidated two lists of punctuation and there were some weird characters there. I think we should figure out why we're getting weird things like that.
That's what I mean, save sentences and then save words later. If we only do sentence splitting when reading in the documents, we won't even have words anyway.
I don't really understand why the parser behaves weirdly sometimes...honestly I'm fine with whatever solution will work.
We can see about splitting sentences only... maybe we could even modify the python package we're using to interface with the java library?
The parser does behave weirdly, but I worry about these weird edge cases becoming actual problems at some point.
I'm not sure we want to go down the rabbit hole of messing with the parser.
However, NLTK does provide a tool which only splits sentences (no tokenization involved):
http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt
Perhaps we could use that to split sentences and then the stanford parser for the rest?
I think it's something to consult Aditi and professor about, especially since it was Aditi's suggestion to do the splitting and parsing separately - maybe she knows how to do it or can help us modify the parser to do it.
Sure. The stanford parser does support only splitting sentences, so maybe modifying our python module won't be so difficult.
I've pushed a new version to github. Can you test it out and tell me if something is breaking? No change in the unit tests and run_pipeline
, it seems. Here are the steps.
pip uninstall stanford-corenlp-python
git clone git@github.com:Wordseer/stanford-corenlp-python.git
cd stanford-corenlp-python
python setup.py install
That should install the new version to your system. I removed the weird lambda that was stripping away apostrophes. If this doesn't break anything, we can get rid of the workaround code (in stringprocessor
) and simplify the code, which should make it easier to write a useful unit test.
Once we have that we'll be close to merging to master.
Still having the same error as last time, it looks like. Have you given it a try?
Well, yes, it's not going to fix the error. What I mean is, this change should mean that there is no more mismatch between the words in the sentences and the dependencies.
If that's true, we can probably remove a decent chunk of code that you wrote to catch those cases, right?
Yes it looks like that's fixed. I'll work on updating the code tomorrow.
Great, let me know when it's finished. I'll upload a new version to pypi tomorrow.
I pushed the changes to removing-readerwriter
, but r_and_j
still won't run because of the CoreNLP abnormal termination error. I think the sentence_error_handling
branch will not error out at the very least, but it'll skip sentences that give errors.
Okay, that's a good start. I'll write a better unit test for stringprocessor.py
. Once we can split sentences we can handle errors better. Do you make a log message when a sentence gives an error?
They're still print
statements but we can just change them to logs.
Unit test has been pushed, we should start getting passing tests now.
I realized we didn't open an issue for it. What are the problems with the unit tests right now and what steps do we need to fix it?