Closed lgray closed 10 months ago
Different topic regarding ML and our facilities: Many of our clusters use SLURM or similar schedulers to allocate resources. However, this does not fit well with ML frameworks for hyperparameter optimization (for more clever things than grid search, anyway). For example using ray tune requires quite a bit of fiddling to work well (I have some demos here). And similarly, synchronization to cloud-based dashboards is hampered by compute nodes not having internet (not sure if this is a universal thing). For example, I had to write this to trigger synchronization to Weights & Biases conveniently/live.
Bottom line: It seems like our systems are not exactly first-class citizens from the point of developers that receive most of their support and applications from industry.
Interested to learn more about ML workflows in facilities and also about https://github.com/HSF/PyHEP.dev-workshops/issues/26
Designing and training ML models in high energy physics is roughly in the state of where analysis was about 5-10 years ago. There's a very large disparity in outcome based on available resources and additional knowledge at specific institutions, and most people are relegated to using a single GPU with hopefully large RAM.
In addition to that the complete workflow from model design and training to deploying that model, in addition to other models, efficiently at scale for multiple users is so far poorly understood (but we are doing our best). We do understand the individual parts reasonably well, but no one aside from an expert can presently make a decently functioning complete system.
Moreover, constructing a complete and efficient system will be extremely integrated with the facility where it is hosted, due to both the necessary compute resources and the data required.
Let's discuss what is needed to bring the steps of the ML-model lifecycle together in a cohesive workflow that addresses: