FluxML / FluxTraining.jl

A flexible neural net training library inspired by fast.ai
https://fluxml.ai/FluxTraining.jl
MIT License
118 stars 25 forks source link

Training loop profiler #86

Open lorenzoh opened 3 years ago

lorenzoh commented 3 years ago

Using Events as hooks into the training loop, it's possible to create a profiler for training loops that measures the time spend executing events but also the time spent inbetween the events, i.e. in the training loop.

This would allow more easily identifying possible performance bottlenecks, like:

Thoughts on implementation | This could be implemented as a callback, though you would need two callbacks one running before all the callbacks and one after the others (to measure callback times) which is unwieldy. This solution may also not play well with the asynchronous callback scheduler proposed in #85. The imo better solution is to implement a callback execution context and does the timings before and after it runs the callbacks. It would wrap another callback execution context that it refers to, thus would also play nicely with the asynchronous callback scheduler as it would measure only the time spent on the synchronous part.

Interpretation | Events that specify start and stop points like StepBegin and StepEnd could be treated as a layer in the profiling stack. Possibly an existing package for visualizing flamegraphs could be reused to make sense of the profiling data.