Open AwePhD opened 1 month ago
Hi @AwePhD,
Thanks for the kind words! We don't have any immediate plans to integrate in other loggers, primarily as it introduces some maintenance over head. I don't know about MLFlow but for W&B, as you control the run_pipeline
function, you should be able to just stick it in there without a problem!
If you manage to get it to work, we'd be delighted if you could share any sample script or such that we could include in the documentation that we could share with others :)
Best, Eddie
Okay, thanks for the answer. Long story short, my deep learning (high-level) domain specific framework implements MLFlow settings and boilerplate. But it seems to conflict with NePS use case. I might investigate and manually set the logger in run_pipeline
.
If I manually implement the MLFlow boilerplate in run_pipeline
I will share it here or in another issue specific to MLFlow, according to how you want to organize the issues. Can I close the issue?
Feel free to leave it open if you plan to share back any findings here, would be super useful :) It would be good to know how the two conflict as I'm not familiar with MLFlow or why they would conflict
mlflow is working correctly with NePS, as I suggested this is my framework that makes a fuss. It does not stop / start mlflow runs correctly in the NePS use case. A manual mlflow.end_run
is required at the end of run_pipeline
. I added a bit more details below, it's mostly irrelevant if you do not use the same lib. But if you are interested with a DL use case, it is something.
mlflow works well with basic boilerplate. It logs the HP and metrics of each run. In a multi fidelity settings, it's fairly easy to resume previous (mlflow) run. Sadly, because I use my framework, I cannot offer a great detailed template. But here the idea.
def run_pipeline(pipeline_directory, previous_pipeline_directory, **config):
# Instantiate model, optimizer and everything else
# Insert mlflow setup: experiment and run names. Logs HP
# Train + Validation for NePS
# Close mlflow and other post things to do
I use mmlab
suite of libraries, notably mmengine, mmcv, mmdet
, and it manages one run in a Runner
object that has a lot of responsibility. This is a highly modular framework, so everything is kind of plug to play. For instance the runner should instantiate a loop (training, test or/and val), optimizer (by the means of a supplemental wrapper), scaling LR, parameters schedulers, sets of hooks, a messagehub/log service, model, pipeline (train_dataloader + its processing) and so on. Those components can be set up with a configuration file and use register for instantiation based on the config file. The Runner
is the glue for everything.
Thus, the Runner
is meant to be the object with the longest lifetime. Plus, it's not meant to be reusable. One Runner
object is for one model performing one task train+validation or test. That's it. NePS use case is different and Runner
is not flexible enough. In another words, NePS manages multiple runs and Runner
is meant to manage one run. So for one NePS run, we should instantiate everything again from scratch, even if most of the component could be reused.
Obviously, it might be doable to extend the mmengine
's Runner
but it would need some times. Regular user does (should?) not have to change the 2k LoC of Runner
. The "quickest" way is to instantiate a runner for each run_pipeline
.
Hello,
I am starting to use NePS for HPO. I see that there is a class for logging in TensorBoard which is great. Are you interested to make an interface for any logger? Other logging tools are common such as MLFlow (I use this one), WandB and so on.
Thanks for NePS, the tools seems interesting and the documentation is great, I have little knowledge in HPO.