Standardize files for the same invocation

Because

As part of moving towards an automated process, it is generally easier if the evaluation process in each task is as similar as possible. Therefore, it should be considered to standardize:

All python evaluation scripts to be evaluate.py
These scripts should use the same argument structure and shortforms, e.g. -g, -m/-p, -r.

Done when

No response

Additional context

No response

clamsproject / aapb-evaluations

Standardize files for the same invocation #41

Because

Done when

Additional context