Closed schmmd closed 4 years ago
@dirkgr @epwalsh @AkshitaB @matt-gardner in triage we mentioned merging evaluate
and predict
, but this seems confusing to me. Do you actually think we should merge these commands, or simply add an option to evaluate that outputs predictions alongside the final metrics?
After some discussion, we realized we don't actually fully understand what predict
does, so we probably need to investigate the current behavior and make a recommendation before jumping into a change.
I use predict
to test behavior of different versions of AllenNLP. We have predict
examples in our demo for each model, for example:
echo '{"passage": "The Matrix is a 1999 science fiction action film written and directed by The Wachowskis, starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano.", "question": "Who stars in The Matrix?"}' | \
allennlp predict https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2020.03.19.tar.gz -
Let's settle this issue.
It seems to me that predict
is for processing one instance at a time, while evaluate
runs over datasets. That seems like a useful distinction that we should keep. Can we make this more clear in the names? I can't think of better names, but we could easily spruce up the help text for those two commands.
With that in mind, we should resolve this issue by making evaluate
output predictions by default, and make it so it can be turned off with a flag. Yay?
The difference between these two commands was originally that one of them used a DatasetReader
and one of them used a Predictor
(hence predict
). The Predictor
means that you take text/JSON inputs and produce text/JSON outputs, while evaluate
doesn't have a Predictor
and thus doesn't produce text/JSON outputs, it instead only computed metrics. There's nothing in the difference between these commands that says anything about the size of the data that's being input.
I can drive to work in a truck, and I can bring bricks to a building site with my bicycle.
It makes sense that we'd have one command that runs readers, and one command that expects JSON. The difference between them is pretty big. Readers need configuration, and they don't even consistently take filenames. Meanwhile, predict
can read from stdin. That's a pretty wide gap to bridge in a single command.
I had not considered the fact that we need a predictor to write output predictions to a file. I guess that means it can't be the default to produce them, because we might not have a predictor.
In practice the only thing we need the Predictor
for is to get inputs into the right format; the Predictor
(typically) doesn't do much on the output side, and it'd be pretty easy to duplicate that inside of evaluate
if that's what we want to do (I think it's just one call to sanitize
).
I think the issue is that historically there were two differences between these two commands: predict
took JSON inputs and gave instance-level outputs, while evaluate
took DatasetReader
inputs and gave metrics as output. People wanted instance-level outputs with DatasetReader
inputs, so we modified predict
to be able to do that. In hindsight, it probably would have been better to go the other way, and just add instance-level outputs in evaluate
, which is what the title of this issue suggests doing. Then, if we wanted, we could also remove the --use-dataset-reader
option from predict
, because having both that flag and the one proposed in this issue means there's almost no difference between the two commands.
We sometimes get requests from users that they would like to store predictions during evaluations (for examples see https://github.com/allenai/allennlp/issues/4620). We are interested in adding a flag to evaluate that outputs predictions along with the final evaluation.