Help managing GPU resources

I'm sure this problem can get very complicated and may require custom implementations, but I was wondering whether you have intentions or ideas on how to manage the limited GPU resources across all MLClient nodes instantiated in a nuke scene.

Nuke MLClient nodes could be talking with different or same classes in the MLServer side, using any kind of back end (pytorch, tensorflow, ..). This is somewhat related to issue #21 but it goes beyond that because it deals with all the classes used in a Nuke session.

Here's a broad idea that may be a good discussion starter: 1) Each instance of MLServer creates a LRU-cache object that is supposed to hold pre-trained models. It could have some options, like how many models the cache can hold, or how much GPU Ram it wants to guarantee to be available at any given time. 2) The MLServer provides this cache object as an API to the model classes, which they use to register their model-loading-method that is supposed to be called only if the model is not in the cache. This custom method should return a reference to the pre-trained model and also the location of the model (ie: "gpu0") 3) The LRU-cache will call the custom function and will catch Memory exceptions during the construction of the model. The exception could trigger the purge of less recent items in the cache and allow retrying the model-loading-method. 4) The LRU-cache object holds a reference to the object returned by the custom model-loading-method (be it pytorch, tensorflow, whatever). 5) It would also verify what's the remaining GPU memory (of the corresponding location) after loading the model and would prune more models to guarantee enough free memory according to the options in 1.

It feels that this more general approach would make issue #21 irrelevant and it would deal with complex scenarios, including multi-gpu.

Do you see a benefit adding something like that to the MLServer API?

TheFoundryVisionmongers / nuke-ML-server

Help managing GPU resources #22