Planning Documentation/Templates for Future Automation of Evals

Because

the goal is to improve the automation of future evaluation tasks (as opposed to updating current evaluations to become automatic), brainstorming of what components could be used to increase the automaticness of running the evaluations should be considered.

This issue will focus on what formats/templates/common practices should be adhered to allow for better automation into this process.

Done when

[ ] A template for future tasks in general, using goldretriever, having the same invocation should be created. This can then be followed for future tasks. Importantly, some error handling, especially at the level of basic sanity tests like outputting the same number of expected files should be present.
[ ] A template for readmes should be created to describe before full automation how to run the apps.
[ ] A template for reports generated by the eval should be created. However, the similarity of each task's report is still in discussion. It is still to be determined how much of the report can be automatically generated.

Additional context

Questions to consider for moving towards automation:

How does one make the process more automatic than downloading the eval repo, placing the mmif predictions in the correct place and running the code?
- Should the evaluation code be wrapped in a docker app that also contains the needed environment and modules to run it?
- Perhaps one could run one app, select the task for which to evaluate for, and the app could handle the rest?
Where should generated mmifs or other system outputs be placed. Should any of that be automatic?

What other todos and concerns should improve this process flow?

clamsproject / aapb-evaluations