MLSysOps / MLE-agent

🤖 MLE-Agent: Your intelligent companion for seamless AI engineering and research. 🔍 Integrate with arxiv and paper with code to provide better code/research plans 🧰 OpenAI, Anthropic, Ollama, etc supported. :fireworks: Code RAG
https://mle-agent-site.vercel.app/
MIT License
1.06k stars 44 forks source link

[feature] add `mle analyze` to summarize training logs and give optimal config #222

Open leeeizhang opened 3 weeks ago

leeeizhang commented 3 weeks ago

RT

huangyz0918 commented 6 days ago

Any updates on the issue?

leeeizhang commented 6 days ago

I think mle analyze can perform as:

  1. ML Experiment Summary: summarize the experiment runs, including the hyper-parameters users recently tried, and the achievements (e.g., loss or accuracy) for those runs. To support this, we should integrate some ML tracking tools (e.g., MLflow, W&B, etc.) first.

  2. Training Suggestor (e.g., HPO, NAS, etc.): suggest hyper-parameters and model architectures based on the experiment summary. Furthermore, we could also allow mle-agent to automatically explore the training configurations and execute them, which could be explored in the future.

  3. Weekly Experiment Report: the ML experiment summary could also be included in the report, as ML scientists may be very interested in their weekly experimental progress. The experiment summary could be an important data source for the report agent.

The divided tasks are:

References:

huangyz0918 commented 5 days ago

What about integrating W&B first? By fetching the experiments, we can know the metrics, the code and the hyper parameters, etc. Then we use an agent to give suggestions, also calling some web search functions (e.g., paper with code benchmark) to give summary and suggestion (e.g., how to tune the parameters to improve certain metrics)

huangyz0918 commented 5 days ago

The train/validation accuracy, the loss items are time-series data, how to analyze such data using LLM is also an interesting problem. Or we can fetch the visualization from W&B and try to analyze the images directly using the muti-modal ability, you can have a try and then we can discuss.

leeeizhang commented 5 days ago

What about integrating W&B first? By fetching the experiments, we can know the metrics, the code and the hyper parameters, etc. Then we use an agent to give suggestions, also calling some web search functions (e.g., paper with code benchmark) to give summary and suggestion (e.g., how to tune the parameters to improve certain metrics)

That's right, I will integrate W&B first. For the analysis parts, we can incorporate the existing AdviseAgent to summarize and suggest.

The train/validation accuracy, the loss items are time-series data, how to analyze such data using LLM is also an interesting problem. Or we can fetch the visualization from W&B and try to analyze the images directly using the muti-modal ability, you can have a try and then we can discuss.

agree with you! How to analyze time-series data is still an open question for exploration, and using multi-modal ability to directly analyze experimental plots or charts is very valuable to try. Nevertheless, since the most common NAS and NPO algorithms still use the final & best accuracy/loss for analyzing and tuning, we may also keep the possibility that directly using each run's best/final metrics as prompts to build the agent in our very beginning PoC.