allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 132 forks source link

Enable evaluation of models on test set in Python code #181

Closed nelson-liu closed 7 years ago

nelson-liu commented 7 years ago

This PR lets you pass in a test_file to evaluate on after training is completed.

TODO:

nelson-liu commented 7 years ago

This PR runs the evaluation of training and validating the attention sum reader on sciQ, and then evaluating on a test set.

nelson-liu commented 7 years ago

This PR is ready to be merged as is, but it'd be nice to be able to do test set evaluation on the best epoch, as well as add this to the scala experiment code.

nelson-liu commented 7 years ago

i think this can be merged as is. Travis pylint is failing, but it looks to be spurious and I can't figure out why it's complaining about that file.

matt-gardner commented 7 years ago

I was wondering why I didn't get notified of this, and why it didn't get merged earlier - I guess you never added me as a reviewer! I'll look over it now.

nelson-liu commented 7 years ago

looks like it still complains

I think the reason this didn't show up before was because we cache the conda environment, so we weren't pulling this new version. I don't think this behavior is desirable, so we can:

  1. remove the conda environments from the cache (this is annoying since we need to reinstall packages every time)
  2. cache the conda environment, but do pip install -U. I'm not sure how this interacts with the requirements.txt file, as I'd like to pull the latest version of the packages that we haven't pinned but not upgrade the packages that we have pinned.

Or we could say nevermind about the new spacy version, pin it to the old version and don't install data. I'm personally partial to 2, thoughts?

nelson-liu commented 7 years ago

looks like -U upgrades only if the requirement in requirements.txt isn't fulfilled --- i'll go ahead and do that

matt-gardner commented 7 years ago

This looks good to me. This does let us evaluate on the best model, by running training without test files, then by running test without training files, which will load the best model. Feel free to merge. I'll merge this soon if you haven't, unless there's something you still want to change.

nelson-liu commented 7 years ago

That's true, but it'd be nice to do it all without manual intervention :)