allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

Add a flag to output predictions during evaluation #4626

Closed schmmd closed 4 years ago

schmmd commented 4 years ago

We sometimes get requests from users that they would like to store predictions during evaluations (for examples see https://github.com/allenai/allennlp/issues/4620). We are interested in adding a flag to evaluate that outputs predictions along with the final evaluation.

schmmd commented 4 years ago

@dirkgr @epwalsh @AkshitaB @matt-gardner in triage we mentioned merging evaluate and predict, but this seems confusing to me. Do you actually think we should merge these commands, or simply add an option to evaluate that outputs predictions alongside the final metrics?

schmmd commented 4 years ago

After some discussion, we realized we don't actually fully understand what predict does, so we probably need to investigate the current behavior and make a recommendation before jumping into a change.

schmmd commented 4 years ago

I use predict to test behavior of different versions of AllenNLP. We have predict examples in our demo for each model, for example:

echo '{"passage": "The Matrix is a 1999 science fiction action film written and directed by The Wachowskis, starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano.", "question": "Who stars in The Matrix?"}' | \
allennlp predict https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2020.03.19.tar.gz -
schmmd commented 4 years ago

Related: https://github.com/allenai/allennlp/issues/2576

dirkgr commented 4 years ago

Let's settle this issue.

It seems to me that predict is for processing one instance at a time, while evaluate runs over datasets. That seems like a useful distinction that we should keep. Can we make this more clear in the names? I can't think of better names, but we could easily spruce up the help text for those two commands.

With that in mind, we should resolve this issue by making evaluate output predictions by default, and make it so it can be turned off with a flag. Yay?

matt-gardner commented 4 years ago

The difference between these two commands was originally that one of them used a DatasetReader and one of them used a Predictor (hence predict). The Predictor means that you take text/JSON inputs and produce text/JSON outputs, while evaluate doesn't have a Predictor and thus doesn't produce text/JSON outputs, it instead only computed metrics. There's nothing in the difference between these commands that says anything about the size of the data that's being input.

dirkgr commented 4 years ago

I can drive to work in a truck, and I can bring bricks to a building site with my bicycle.

It makes sense that we'd have one command that runs readers, and one command that expects JSON. The difference between them is pretty big. Readers need configuration, and they don't even consistently take filenames. Meanwhile, predict can read from stdin. That's a pretty wide gap to bridge in a single command.

I had not considered the fact that we need a predictor to write output predictions to a file. I guess that means it can't be the default to produce them, because we might not have a predictor.

matt-gardner commented 4 years ago

In practice the only thing we need the Predictor for is to get inputs into the right format; the Predictor (typically) doesn't do much on the output side, and it'd be pretty easy to duplicate that inside of evaluate if that's what we want to do (I think it's just one call to sanitize).

I think the issue is that historically there were two differences between these two commands: predict took JSON inputs and gave instance-level outputs, while evaluate took DatasetReader inputs and gave metrics as output. People wanted instance-level outputs with DatasetReader inputs, so we modified predict to be able to do that. In hindsight, it probably would have been better to go the other way, and just add instance-level outputs in evaluate, which is what the title of this issue suggests doing. Then, if we wanted, we could also remove the --use-dataset-reader option from predict, because having both that flag and the one proposed in this issue means there's almost no difference between the two commands.