Add train and infer time values to evaluate_baselines results.
Infer time is based only on the non-zero weight models in the ensemble, accurately reflecting the true inference time in practice.
Added metadata output to evaluate_ensemble, allows for more nuanced information, such as ensemble weights for each task for a given portfolio. This allows us to compute the mean ensemble weight for a given model across the benchmark.