define "evaluator" interface

(subtask of #3)

We'd like to define a minimal, but concrete behavior of the class of "evaluator". Some features are also discussed in https://github.com/clamsproject/aapb-annotations/issues/2#issuecomment-1542748851. At the very minimum, an "evaluator" should be able

take a batch of gold and a batch predictions and return a single HTML file with the evaluation result
take batches of gold and batches of predictions and return a single HTML file with all the evaluation results and aggregated result.

Gold files are freely accessible from the https://github.com/clamsproject/aapb-annotations repository, but predictions files almost always need to be generated on demand, and in many cases (vision, audio apps) generating predictions will take hours, if not days, even with a small size batch. But running CLAMS pipelines, waiting for the generation for predictions (MMIF), and finally obtaining those MMIF files should not be responsibility of evaluators, but instead the evaluation "runner" or "invoker" should take charge of obtaining all golds and preds files before an evaluator runs.

clamsproject / aapb-evaluations

define "evaluator" interface #9