Implement the ML Flow experiment tracker

slippylolo commented 3 years ago

Motivation. As @sashavor suggested, the carbon footprint working group needs an experiment tracker to properly follow all runs being done. An experiment tracker could also be more broadly interesting to centralise in one place all experiments being done.

Proposed solution. Following a discussion with @thomwolf, the carbon WG has identified MLFlow as a promising open-source option: it supports having a dedicated server, and can interface with TensorBoard logs (which we already produce). This blog post shows how it integrates with Tensorflow. There is also documentation on how to interface with PyTorch models.

Implementation. It's not quite clear how nicely will MLFlow play with Megatron/DeepSpeed+the limited networking on Jean Zay. The goal here is to first build a proof-of-concept showing MLFlow can integrate in our codebase and phone back to the centralised server from Jean Zay. We can then consider the finer details of reporting all metrics of interest to us.

stas00 commented 3 years ago

Here is the current setup:

GPU instances have no access to the Internet.
There is no crontab or an option to run a daemon on JZ.
Currently we gather the tensorboard logs via a shared fs and then push them to the hub periodically via a slurm job from a special very resource-limited partition which has Internet.

huu4ontocord commented 3 years ago

Is it that the GPU instances can download stuff from the internet but can't expose a port? We could use ngrok or tunnel if that is permitted. https://stackoverflow.com/questions/61615818/setting-up-mlflow-on-google-colab You could run mlflow while your job is running in the background, and then after you are done, you can transfer the logs (which I beleive can be sqlite file) to another server for further analysis.

One issue is that there's no oauth authenticaiton (as yet) with mlflow, so you could do simple password via ngrok.

If it's not permitted to directly access the GPU instances via ngrok,you can still log stuff to mlflow, and save away the data into the sqlite logs and transfer to another server for visualization periodically the way you do tensorboard logs. Not as real-time, but probably the best option as we know it works for tensorboard.

As I think about this more, for security reasons, I would not want to use the ngrok method, even if it is permitted.

Another alternative is to use the resource-limited partition as an Rest API proxy to an external ML flow server, but that still brings security concerns. You would need to have a whitelist to permit only intneral JZ nodes to talk to the proxy. ML Flow has rich Rest API endpoints: https://www.mlflow.org/docs/latest/rest-api.html

stas00 commented 3 years ago

There is no internet on GPU instances, period. The only way to communication from those to the world is via the shared filesystem, which can then be picked up by a slurm process on a special partition that has no GPUs but which can broadcast to the world - which is what we do with tensorboard logs on an hourly basis.

If you have other logs like tensorboard logs, we can easily send those files to the hub on the hourly basis and from there you can do anything you want with those.

sashavor commented 3 years ago

MLFlow does generate Tensorboard logs, so that's definitely possible.

We had spoken with @thomwolf about setting up a server instance that we can share all the logs to, but I guess if there's zero internet access, that wouldn't be possible.

Maybe we can get them to open up a single port specifically for this purpose?

stas00 commented 3 years ago

Maybe we can get them to open up a single port specifically for this purpose?

It's very unlikely, but there is no harm in asking. I will ask.

huu4ontocord commented 3 years ago

I think the copying of the logs over (maybe every 15 minutes?) to another external server might be a good solution.

You can use the filestore which I believe saves to ./mlruns by experiment number. https://www.mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded You could copy gzip changed data from these folder and copy to external server.

Or you can use the sqlite storage and copy he whole db over:

https://medium.com/@moyukh_51433/mlflow-storing-artifacts-in-hdfs-and-in-an-sqlite-db-7be26971b6ab Use --backend-store-uri to configure the type of backend store. You specify a file store backend as ./path_to_store or file:/path_to_store and a database-backed store as SQLAlchemy database URI. The database URI typically takes the format <dialect>+<driver>://<username>:<password>@<host>:<port>/<database>. MLflow supports the database dialects mysql, mssql, sqlite, and postgresql. Drivers are optional. If you do not specify a driver, SQLAlchemy uses a dialect’s default driver. For example, --backend-store-uri sqlite:///mlflow.db would use a local SQLite database.

Maybe If you don't copy the whole sqlite file over, you could run sqldiff and import the diff into the remote sqlite backed server.

https://www.sqlite.org/sqldiff.html

Bu that will probably take a bit of coding.

stas00 commented 3 years ago

Maybe we can get them to open up a single port specifically for this purpose?

It's very unlikely, but there is no harm in asking. I will ask.

The answer by the JZ admin was no.

sashavor commented 3 years ago

OK, so we can go with an option to save logs locally and copy them over periodically?

On Tue, Aug 10, 2021 at 1:44 PM Stas Bekman @.***> wrote:

Maybe we can get them to open up a single port specifically for this purpose?

It's very unlikely, but there is no harm in asking. I will ask.

The answer by the JZ admin was no.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/54#issuecomment-896187212, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMMIIS5MCR5OH5AI3QKJHTT4FQQHANCNFSM5BXHO24A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Sasha Luccioni, PhD Postdoctoral Researcher (Mila, Université de Montréal) Chercheure postdoctorale (Mila, Université de Montréal) https://www.sashaluccioni.com/ [image: Image result for universite de montreal logo]

stas00 commented 3 years ago

Absolutely.

Just make a PR that implements what's needed:

with an CLI option to activate it
updated dependencies file

and then we will create a target repo on the hub to sync data to.

bigscience-workshop / Megatron-DeepSpeed

Implement the ML Flow experiment tracker #54