Improve Evaluation (Matrix) Infrastructure

Before #334, we were only supporting evaluating the model on a single fixed evaluation dataset.

In #334, we introduce a evaluation matrix in which we evaluate each trained model on each trigger training set at the end. However, the current implementation is quite hacky:

We extended the evaluation code path with some optionals that indicate the use of the OnlineDataset in the evaluator. We might just want to have two different kinds of evaluation requests instead of checking for Nones everywhere. The control flow needs cleaning.
We can parallelize the evaluations (depending on GPU capacity) and also overlap evaluation and training. We need to make the supervisor (server) less sequential anyways, and implement a queue of requests that we want to do per pipeline after the supervisor server is implemented. This avoids doing the entire matrix at the end.
In the matrix scenario, we currently take some settings from training and some from evaluation of the pipeline, which is not very clean. We should instead add the evaluation options to the pipeline and differentiate between fixed evaluation and matrix evaluation.
We need to add tests for this.

eth-easl / modyn

Improve Evaluation (Matrix) Infrastructure #335