Webhosting of models and high memory consumption

ddobric commented 3 years ago

We have a web application that hosts a trained model to enable users for prediction scenarios. The solution is based on the ImageClassificationModelTraining.Solution in the machinelearning-samples repo. The training was done by following code (just a snippet for a case that it is important):

            var pipeline = mlContext.MulticlassClassification.Trainers.ImageClassification(options: hyperParams)
            .Append(mlContext.Transforms.Conversion.MapKeyToValue(outputColumnName: "PredictedLabel",
                                                                      inputColumnName: "PredictedLabel"));

            // Apply 5-fold cross validation
            var cvResults = mlContext.MulticlassClassification.CrossValidate(cvDataView, pipeline, numberOfFolds: numOfFolds, labelColumnName: "LabelAsKey", seed: 8881);

            // Get best Model which is on the first place
            var topModel = cvResults[0].Model;

            //Show the performance metrics for the multi-class classification            
            var metrics = mlContext.MulticlassClassification.Evaluate(cvResults[0].ScoredHoldOutSet, labelColumnName: "LabelAsKey", predictedLabelColumnName: "PredictedLabel");

To make this working, we have loaded a pool of Prediction Engine instances, which will be assigned to incoming requests. Following code shows how instances are created on startup.

       private List<PredictionEngine<TSrc, TDest>> LoadPool(string modelFullPathName)
        {
            List<PredictionEngine<TSrc, TDest>> engines = new List<PredictionEngine<TSrc, TDest>>();

            for (int i = 0; i < config.PoolSize; i++)
            {
                var mlnetModel = mlContext.Model.Load(modelFullPathName, out _);//ModelInfo.ServerFilePath
                var predictionEngine = mlContext.Model.CreatePredictionEngine<TSrc, TDest>(mlnetModel);
                engines.Add(predictionEngine);
            }

            return engines;
        }

This part works fine. As next, we have measured how much RAM the application will need when deployed to the AppService (or anything else). We figured out that the memory consumption of trained model is extremely high. Following diagram shows the behaviour of the prediction engine. First, we load a set of prediction engines (in this example 6 instances) by using the code shown above.

After loading some space in RAM is consumed. However on the first Predict invoke of the particular instance of the prediction engine, there is a peak of 1.5-2.0 GB. After the peak, the memory consumption gets stable again.

The issue with the peak is that, when it happens, it causes the AppService health feature sometimes to restart the service. Ok, it is not nice, but it can be fixed by using higher AppService offering. However, it would be good to know where does the peak comes from for the case that it can get higher than 2GB.

Another negative observation is the high memory consumption of the single instance of the prediction engine. Following diagram shows the consumption in dependence on the number of instances of the prediction engine.

The blue line shows the consumption after the load of the predication engine and the green one shows consumption of the prediction engine instances after the Predict method has been invoked on each of them.

The dotted line is the memory consumption as calculated by the formel shown in the diagram. The issue with behaviour is that the consumption of the single prediction engine instance is approx. 600MB, which is too much. We could easily calculate here how much would cost the App Service with just 100 concurrent users. It is too much for this scenario.

We can understand and agree that training is a heavy scenario and might require a lot of memory and CPU resources. However trained models must be more lightweight.

System information

Windows 10:
DotNet 3.1.402:

harishsk commented 3 years ago

Hi @ddobric Given that .NET code runs in a managed memory environment, the memory usage patterns are somewhat non-deterministic and subject to how the garbage collector behaves. In most cases this works out as expected, but occasionally when the managed object holds references to unmanaged memory this can be a problem. In this particular case, the image classification trainer relies on TF.NET which in turns holds on to unmanaged memory in the tensorflow core.

It is possible to bring some amount of determinism by trying to explicitly dispose off some of the objects that reference TF.NET objects. You can do that by disposing off the unused models in the list of results returned from mlContext.MulticlassClassification.CrossValidate. ( That is call (cvResult.Model as IDisposable)?.Dispose() for all cvResults with index > 1 - the models that you are not using.) And also remember to dispose off the top model after you are done using it.

Also, in your case since your model relies on TF.NET, you can also control the memory usage a bit by disposing off the loaded mlnetModel as above.

You can find more examples of this in TensorflowTests

Hope that answers your questions.

ddobric commented 3 years ago

Hi @harishsk,

thanks for your answer. I agree on TF.NET and TensorFlow chain, regarding memory usage. However, the training code posted above with cvResult is not running in the web application. I showed this code to explain which example is related to my solution. The code that is running in the web application is implemented in the method LoadPool. That code is the code which only uses prediction and it should not consume GBs memory. Diagrams shown above have nothing with the training to do. The web application does the following:

Load Model
Create Prediction Engine
Invokes predict.

Hope this clarifies better the issue.

harishsk commented 3 years ago

Hi @ddobric, Can you please share a small but complete repro that illustrates your problem?

QuangBui3101 commented 3 years ago

Hi @harishsk,

following @ddobric issue, the ASP.NET Core application in this repository illustrates the problem. When starting the application, it will start a browser with the predefined URL which then trigger the following sequence:

Load Model and create prediction engine
Invoke Predict of prediction engine
Invoke Predict for the second time

The number of instances of model that is loaded into the memory is defined and can be changed in appsettings.json under EnginesConfig.PoolSize. Following is an example snippet of appsettings.json

{
    ...
    "EnginesConfig": {
        "PoolSize": 5,
        "ModelName": "114_Repcon_KRE_zip"
    }
}

harishsk commented 3 years ago

Hi @QuangBui3101 and @ddobric,

Thank you for the repro case. I have been debugging it and evaluating the memory usage. The increase you see on the first call is not unexpected and not due to ML.NET.

If you turn on symbol server and source server in your debugging options, you can follow along the explanation below.

Firstly, ML.NET is designed from the ground up for lazy evaluation. That means, tasks such as memory evaluation and calculations are deferred until they are actually necessary. So when you load the model and create the prediction engine, not all the necessary memory is allocated right away (e.g. the Classifier object in ImageClassificationTrainer.cs.) Those objects are created only on the first call to predict. But those objects do not contribute to most of the memory usage in this case. The sudden increase of close to 500K that you see on the first call are almost all coming from the call to Classifier.Score and specifically within the two calls to ProcessImage and _runner.AddInput

Each of those calls end up calling c_api.TF_SessionRun. Almost all the memory increase you see is coming from within the call to TF.NET and TF. You are seeing a memory increase of about 500KB per prediction engine almost all of that is coming from TF. And with multiple prediction engines instantiated simultaneously, it is reasonable to expect gigabytes of memory usage. It may be possible to optimize the memory usage either in the model or in TF, but that would be outside the scope of this repo.

I have also confirmed your observation that memory consumption remains stable after the peak.

Please let me know if you have any further questions or concerns.

ddobric commented 3 years ago

Hi Harish,

thank you so much for your valuable feedback. I totally agree on lazy load behavior, which ich acceptable. You have also approved that TF is reposnsible for memory consumption and not directly ML.NET. I have expected this. However, at the end we are talking about 500MB and not 500kb. 🤔 That is very strange behavior and not really acceptable. This is the required RAM per user request.

Damir

harishsk commented 3 years ago

Sorry, that was a typo from my end. The spikes in memory usage that Visual Studio shows are all in MB. (Not in KB as I wrote above)

This is what I see

Before the call to CreatePredictionEngine : 196MB
After the call to CreatePredictionEngine and before calling Predict: 572MB
Within the call to Predict:
- Before calling _imageProcessor.ProcessImage in Classifier.Score in ImageClassificationTrainer.cs: 572MB Before calling c_api.TF_SessionRun in Runner.Run: 572MB After the call above: 668MB
- After the call to _imageProcessor.ProcessImage and before calling _runner.AddInput: 668MB
  - Before calling c_api.TF_SessionRun in Runner.Run: 668MB
  - After the call above: Memory peaks to 1.2GB and returns 805MB

As you can see from the above, the 500+MB memory increase on the first predict call is almost all coming from the two calls to TF_SessionRun. That seems to be the memory required by Tensorflow to execute inferencing on those models.

ddobric commented 3 years ago

@harishsk thanks for your feedback. Now, I assume we have the same understanding of the behaviour.

That seems to be the memory required by Tensorflow to execute inferencing on those models.

the 500+MB memory increase

I hope we also agree that 500MB per prediction engine is too much?

We are using this approach for web applications and thinking on mobile devices. The later one is on pending with dependency to ML.NET mobile support.

The high memory consumption is an extremely limiting factor. Whan can be done to optimize this?

Damir

harishsk commented 3 years ago

When I add memory traces after each of the calls involved, this is what I see:

Memory before Model.Load: 0.0342 GB
Memory after Model.Load: 0.2572 GB
Memory before CreatePredictionEngine: 0.2572 GB
Memory after CreatePredictionEngine: 0.2698 GB
Memory before calling Predict : 0.2698 GB
Memory after calling Predict : 0.6847 GB

So roughly speaking

Memory for model load : 223MB
Memory for CreatePredictionEngine: 12.6MB
Memory for Predict: 414.9MB

The bulk of the memory is still consumed by the combination of loading the model and running prediction on it. Both those are functions of the size and functionality of the model.

I am afraid I am unable to advise you on how best to optimize your TensorFlowModel to lower the memory used.

harishsk commented 3 years ago

I hope that answers your questions. Please feel free to reopen the issue if you have more questions.

ddobric commented 3 years ago

To me, it is ok to close the issue if we cannot improve it. But, we have to conclude that the high memory consumption of TensorFlow makes hosting of models in the Web, only theoretically possible. Because ML.NET wraps up TensorFlow (in this specific case), the issue cannot be fixed inside of ML.NET. It has to be done in the TensorFlow.

How about tagging it as "Unresolved"?

dotnet / machinelearning

Webhosting of models and high memory consumption #5432

System information