Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
42 stars 16 forks source link

Revise and pass unit tests #102

Closed keien closed 10 years ago

keien commented 10 years ago

I realized we didn't open an issue for it. What are the problems with the unit tests right now and what steps do we need to fix it?

abendebury commented 10 years ago

So far, the only failing test in the pipeline is test_parse for the StringProcessor since you've made a lot of changes. I'm working on that.

There's two other errors in the application due to Hassan's branch, so once he fixes those we can merge in the changes and those issues will be resolved.

keien commented 10 years ago

That's good, I was afraid there were tons of errors. Let me know if you want to change the way parse behaves or split it up or something to make unit testing easier.

abendebury commented 10 years ago

Making the default behavior of save to actually write to the database fixed a lot of the errors, we just have to make sure the pipeline passes in the correct argument.

Splitting up parse might be a good idea, it's quite long. I'll look at that.

abendebury commented 10 years ago

Here's something I noticed: If there is more than one word that's not found in the database, the program will fail since it tries to set 0 as an ID for the Words.

https://github.com/Wordseer/wordseer_flask/blob/removing-readerwriter/lib/wordseerbackend/wordseerbackend/stringprocessor.py#L124

keien commented 10 years ago

So the main problem here is outlined here: https://github.com/Wordseer/wordseer_flask/blob/removing-readerwriter/lib/wordseerbackend/wordseerbackend/stringprocessor.py#L104

The one case I know of is apostrophes - words like "we'll" splits into either "we" and "ll" or "we" and "'ll" (the latter with the apostrophe included). I don't know what other cases there are, but if we can sort this out, we can revert to the old filter_by and there should never be a case when the filter_by returns anything but one result, as Word is inherently unique on word, lemma, and part_of_speech.

The other option is to figure out how to split sentences and parse separately. Then, during structure extractions, we can just handle sentences and not save words, and then save words and dependencies at the same time, avoiding the problem altogether. I think this is the better solution, but we need to figure out how to split up the parsing process.

abendebury commented 10 years ago

The other option is to figure out how to split sentences and parse separately.

I think the parser splits sentences before it parses, why do we need to do the splitting?

keien commented 10 years ago

Well we want the parser to output the split sentence before it parses, because we want to do the splitting and parsing in different parts of the pipeline. Hence:

Then, during structure extractions, we can just handle sentences and not save words, and then save words and dependencies at the same time

abendebury commented 10 years ago

Well, you can't save sentences without saving their words too, unless you want to save sentences first and then add in their words.

I don't understand why there would be a mismatch or why we'd be getting we 'll sometimes and we ll other times... unless it's not an actual ' character and instead some sort of look-alike. I recently consolidated two lists of punctuation and there were some weird characters there. I think we should figure out why we're getting weird things like that.

keien commented 10 years ago

That's what I mean, save sentences and then save words later. If we only do sentence splitting when reading in the documents, we won't even have words anyway.

I don't really understand why the parser behaves weirdly sometimes...honestly I'm fine with whatever solution will work.

abendebury commented 10 years ago

We can see about splitting sentences only... maybe we could even modify the python package we're using to interface with the java library?

The parser does behave weirdly, but I worry about these weird edge cases becoming actual problems at some point.

abendebury commented 10 years ago

I'm not sure we want to go down the rabbit hole of messing with the parser.

However, NLTK does provide a tool which only splits sentences (no tokenization involved):

http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt

Perhaps we could use that to split sentences and then the stanford parser for the rest?

keien commented 10 years ago

I think it's something to consult Aditi and professor about, especially since it was Aditi's suggestion to do the splitting and parsing separately - maybe she knows how to do it or can help us modify the parser to do it.

abendebury commented 10 years ago

Sure. The stanford parser does support only splitting sentences, so maybe modifying our python module won't be so difficult.

abendebury commented 10 years ago

I've pushed a new version to github. Can you test it out and tell me if something is breaking? No change in the unit tests and run_pipeline, it seems. Here are the steps.

  1. pip uninstall stanford-corenlp-python
  2. git clone git@github.com:Wordseer/stanford-corenlp-python.git
  3. cd stanford-corenlp-python
  4. python setup.py install

That should install the new version to your system. I removed the weird lambda that was stripping away apostrophes. If this doesn't break anything, we can get rid of the workaround code (in stringprocessor) and simplify the code, which should make it easier to write a useful unit test.

Once we have that we'll be close to merging to master.

keien commented 10 years ago

Still having the same error as last time, it looks like. Have you given it a try?

abendebury commented 10 years ago

Well, yes, it's not going to fix the error. What I mean is, this change should mean that there is no more mismatch between the words in the sentences and the dependencies.

If that's true, we can probably remove a decent chunk of code that you wrote to catch those cases, right?

keien commented 10 years ago

Yes it looks like that's fixed. I'll work on updating the code tomorrow.

abendebury commented 10 years ago

Great, let me know when it's finished. I'll upload a new version to pypi tomorrow.

keien commented 10 years ago

I pushed the changes to removing-readerwriter, but r_and_j still won't run because of the CoreNLP abnormal termination error. I think the sentence_error_handling branch will not error out at the very least, but it'll skip sentences that give errors.

abendebury commented 10 years ago

Okay, that's a good start. I'll write a better unit test for stringprocessor.py. Once we can split sentences we can handle errors better. Do you make a log message when a sentence gives an error?

keien commented 10 years ago

They're still print statements but we can just change them to logs.

abendebury commented 10 years ago

Unit test has been pushed, we should start getting passing tests now.