dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.01k stars 1.88k forks source link

Cuda v11.8 support for image classification problems? #7180

Closed RossHNPC closed 2 months ago

RossHNPC commented 3 months ago

Current implementations for Image classification, following MS guidance restricts us to using Cuda SDK v10.1. This in turn limits the available Nvidia cards that can be used. Going to a production install we need to use cards supported/supplied by IT vendors, DELL, etc. These tend to be newer cards that are beyond the Turing architecture: Ampere, Ada Lovelace, etc.

We would like for the Cuda SDK support to be raised to v11.8 to take advantage of a wider range of supported cards.

Alternatives at present are code fixes to block inference from happening before the models are loaded and available, this can be anywhere from 1 - 20 mins for a basic image classification model. Older cards, my laptop has a lowly T500, are loading in 10's of seconds. As a real time implementation this is a lot of missed classifications.

If there is no intention to update Cuda support please do let us know as we can then look at alternatives.

RossHNPC commented 2 months ago

Ultimately this was a configuration issue on our target server. Local dev is carried out with T500 cards (Turing) architecture and compatible with the native code; i.e. very quick load time.

Production servers have Ampere based cards which require the PTX code to be JIT compiled and hence the long load time. The JIT compiled outputs should be cached on the server; however the default cache size is 1GB which isn't large enough for our purposes.

Solution was to add CUDA_CACHE_MAXSIZE = 4294967296 as an Environment variable.

Final cache was 1.2GB and the model loads as quickly as on the Turing cards.