Long Model Loading times in Multimodel Server

Describe the bug According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again. First Question would be: What does 'frequently' mean?

If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.

To reproduce

Deploy a Sagemaker Multimodel Server using boto3

Generate a sagemaker runtime_client using boto3 and execute the following code:

for i in range(20):
start = time.time()
response = rt_client.invoke_endpoint(
                          EndpointName=self.endpoint_name,
                          ContentType='application/x-npy',
                          TargetModel='model_store/custom_model_1.tar.gz', # Constantly the same model
                          Body=payload, # Byte encoded numpy array
                      )
end = time.time()
response_time  = end - start
print(f'Request took {response_time}s'.)
time.sleep(30)

Expected behavior First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.

Once i set the time.sleep() argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.

Ist there any way to influence the timing of the unloading behavior? To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.

Screenshots or logs Time sleep 30s:

     Call: 0 of 20 with 4 samples took: 2.847299098968506s.
     Call: 1 of 20 with 4 samples took: 3.017570734024048s.
     Call: 2 of 20 with 4 samples took: 2.866020917892456s.
     Call: 3 of 20 with 4 samples took: 2.888610363006592s.
     Call: 4 of 20 with 4 samples took: 3.0125389099121094s.
     Call: 5 of 20 with 4 samples took: 2.9569602012634277s.
     Call: 6 of 20 with 4 samples took: 2.8126561641693115s.
     Call: 7 of 20 with 4 samples took: 2.912917375564575s.
     Call: 8 of 20 with 4 samples took: 2.866114854812622s.
     Call: 9 of 20 with 4 samples took: 2.9781384468078613s.
     Call: 10 of 20 with 4 samples took: 3.4418649673461914s.
     Call: 11 of 20 with 4 samples took: 2.79472017288208s.
     Call: 12 of 20 with 4 samples took: 2.992703437805176s.
     Call: 13 of 20 with 4 samples took: 2.954014301300049s.
     Call: 14 of 20 with 4 samples took: 2.9481523036956787s.
     Call: 15 of 20 with 4 samples took: 2.928661346435547s.
     Call: 16 of 20 with 4 samples took: 2.8345978260040283s.
     Call: 17 of 20 with 4 samples took: 2.922405481338501s.
     Call: 18 of 20 with 4 samples took: 2.982257843017578s.
     Call: 19 of 20 with 4 samples took: 2.8227620124816895s.

Time sleep(20)s

     Call: 0 of 20 with 4 samples took: 3.329136848449707s.
     Call: 1 of 20 with 4 samples took: 0.5629911422729492s.
     Call: 2 of 20 with 4 samples took: 0.5595850944519043s.
     Call: 3 of 20 with 4 samples took: 0.5578911304473877s.
     Call: 4 of 20 with 4 samples took: 0.5557725429534912s.
     Call: 5 of 20 with 4 samples took: 0.5681345462799072s.
     Call: 6 of 20 with 4 samples took: 0.5488979816436768s.
     Call: 7 of 20 with 4 samples took: 0.5555169582366943s.
     Call: 8 of 20 with 4 samples took: 0.5792186260223389s.
     Call: 9 of 20 with 4 samples took: 0.9297688007354736s.
     Call: 10 of 20 with 4 samples took: 0.6043572425842285s.
     Call: 11 of 20 with 4 samples took: 0.572312593460083s.
     Call: 12 of 20 with 4 samples took: 0.5600907802581787s.
     Call: 13 of 20 with 4 samples took: 2.9460437297821045s.
     Call: 14 of 20 with 4 samples took: 0.5780775547027588s.
     Call: 15 of 20 with 4 samples took: 0.5762953758239746s.
     Call: 16 of 20 with 4 samples took: 0.5773897171020508s.
     Call: 17 of 20 with 4 samples took: 0.5769815444946289s.
     Call: 18 of 20 with 4 samples took: 0.5663411617279053s.
     Call: 19 of 20 with 4 samples took: 0.579679012298584s.

System information

Custom Docker Image:
- Inference Framework: SkLearn
- Sagemaker Inference Toolkit: 1.6.1
- Multimodel Server: 1.1.8
- Python version: 3.9
- processing unit type CPU (ml.t2.medium)

aws / sagemaker-inference-toolkit

Long Model Loading times in Multimodel Server #113