This PR is motivated by the need to run individual scoring jobs for each model/benchmark pair on an HPC (e.g. Openmind) instead of running a single job to compute scores for all new models and benchmarks in a submission. We handle this by changing the structure of the scoring endpoint to separate getting the model/benchmark names functionality from the actual scoring (for example, to evaluate ALL_PUBLIC without necessarily scoring). As a result, the domain-specific plugin manager now has the flexibility to decide the best method for identifying and scoring the model/benchmark pairs.
Testing Strategy
This PR implements unit tests that separately test the functionality of both model/benchmark retrieval and scoring using dummy models and benchmarks.
Description
This PR is motivated by the need to run individual scoring jobs for each model/benchmark pair on an HPC (e.g. Openmind) instead of running a single job to compute scores for all new models and benchmarks in a submission. We handle this by changing the structure of the scoring endpoint to separate getting the model/benchmark names functionality from the actual scoring (for example, to evaluate
ALL_PUBLIC
without necessarily scoring). As a result, the domain-specific plugin manager now has the flexibility to decide the best method for identifying and scoring the model/benchmark pairs.Testing Strategy
This PR implements unit tests that separately test the functionality of both model/benchmark retrieval and scoring using dummy models and benchmarks.