automl / neps

Neural Pipeline Search (NePS): Helps deep learning experts find the best neural pipeline.
https://automl.github.io/neps/
Apache License 2.0
51 stars 12 forks source link

[Question] Interface for generic logger #114

Open AwePhD opened 1 month ago

AwePhD commented 1 month ago

Hello,

I am starting to use NePS for HPO. I see that there is a class for logging in TensorBoard which is great. Are you interested to make an interface for any logger? Other logging tools are common such as MLFlow (I use this one), WandB and so on.

Thanks for NePS, the tools seems interesting and the documentation is great, I have little knowledge in HPO.

eddiebergman commented 1 month ago

Hi @AwePhD,

Thanks for the kind words! We don't have any immediate plans to integrate in other loggers, primarily as it introduces some maintenance over head. I don't know about MLFlow but for W&B, as you control the run_pipeline function, you should be able to just stick it in there without a problem!

If you manage to get it to work, we'd be delighted if you could share any sample script or such that we could include in the documentation that we could share with others :)

Best, Eddie

AwePhD commented 2 weeks ago

Okay, thanks for the answer. Long story short, my deep learning (high-level) domain specific framework implements MLFlow settings and boilerplate. But it seems to conflict with NePS use case. I might investigate and manually set the logger in run_pipeline.

If I manually implement the MLFlow boilerplate in run_pipeline I will share it here or in another issue specific to MLFlow, according to how you want to organize the issues. Can I close the issue?

eddiebergman commented 2 weeks ago

Feel free to leave it open if you plan to share back any findings here, would be super useful :) It would be good to know how the two conflict as I'm not familiar with MLFlow or why they would conflict

AwePhD commented 2 weeks ago

mlflow is working correctly with NePS, as I suggested this is my framework that makes a fuss. It does not stop / start mlflow runs correctly in the NePS use case. A manual mlflow.end_run is required at the end of run_pipeline. I added a bit more details below, it's mostly irrelevant if you do not use the same lib. But if you are interested with a DL use case, it is something.

mlflow works well with basic boilerplate. It logs the HP and metrics of each run. In a multi fidelity settings, it's fairly easy to resume previous (mlflow) run. Sadly, because I use my framework, I cannot offer a great detailed template. But here the idea.

def run_pipeline(pipeline_directory, previous_pipeline_directory, **config):
    # Instantiate model, optimizer and everything else
    # Insert mlflow setup: experiment and run names. Logs HP
    # Train + Validation for NePS
    # Close mlflow and other post things to do

I use mmlab suite of libraries, notably mmengine, mmcv, mmdet, and it manages one run in a Runner object that has a lot of responsibility. This is a highly modular framework, so everything is kind of plug to play. For instance the runner should instantiate a loop (training, test or/and val), optimizer (by the means of a supplemental wrapper), scaling LR, parameters schedulers, sets of hooks, a messagehub/log service, model, pipeline (train_dataloader + its processing) and so on. Those components can be set up with a configuration file and use register for instantiation based on the config file. The Runner is the glue for everything.

Thus, the Runner is meant to be the object with the longest lifetime. Plus, it's not meant to be reusable. One Runner object is for one model performing one task train+validation or test. That's it. NePS use case is different and Runner is not flexible enough. In another words, NePS manages multiple runs and Runner is meant to manage one run. So for one NePS run, we should instantiate everything again from scratch, even if most of the component could be reused.

Obviously, it might be doable to extend the mmengine's Runner but it would need some times. Regular user does (should?) not have to change the 2k LoC of Runner. The "quickest" way is to instantiate a runner for each run_pipeline.