aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
370 stars 82 forks source link

Long Model Loading times in Multimodel Server #113

Open AlexRaschl opened 1 year ago

AlexRaschl commented 1 year ago

Describe the bug According to the Sagemaker Multimodel Server documentation the server caches 'frequently' used models in memory (to my understanding in RAM) in order to increase response time via avoiding to load the model again and again. First Question would be: What does 'frequently' mean?

If I query the same model again and again with a delay of 30s between the invoke_endpoint calls, the server seems to load the model again into memory leading to long response times of 3s instead of the usual ~0.5s obtained via calling the model in <30s interval.

To reproduce

Expected behavior First call is slow (about 3s) and the following 19 calls lie in the expected ~0.5s range, which is the time it takes to call the endpoint when the model is already loaded.

Once i set the time.sleep() argument lower than 30s, f.e. to 20s, the calls are most of the time as fast as expected.

Ist there any way to influence the timing of the unloading behavior? To my understanding I would expect that the model stays in memory as long as the memory is not needed for loading other more frequently used models. However, this does not seem to be the case, as each call takes the full 3s.

Screenshots or logs Time sleep 30s:

     Call: 0 of 20 with 4 samples took: 2.847299098968506s.
     Call: 1 of 20 with 4 samples took: 3.017570734024048s.
     Call: 2 of 20 with 4 samples took: 2.866020917892456s.
     Call: 3 of 20 with 4 samples took: 2.888610363006592s.
     Call: 4 of 20 with 4 samples took: 3.0125389099121094s.
     Call: 5 of 20 with 4 samples took: 2.9569602012634277s.
     Call: 6 of 20 with 4 samples took: 2.8126561641693115s.
     Call: 7 of 20 with 4 samples took: 2.912917375564575s.
     Call: 8 of 20 with 4 samples took: 2.866114854812622s.
     Call: 9 of 20 with 4 samples took: 2.9781384468078613s.
     Call: 10 of 20 with 4 samples took: 3.4418649673461914s.
     Call: 11 of 20 with 4 samples took: 2.79472017288208s.
     Call: 12 of 20 with 4 samples took: 2.992703437805176s.
     Call: 13 of 20 with 4 samples took: 2.954014301300049s.
     Call: 14 of 20 with 4 samples took: 2.9481523036956787s.
     Call: 15 of 20 with 4 samples took: 2.928661346435547s.
     Call: 16 of 20 with 4 samples took: 2.8345978260040283s.
     Call: 17 of 20 with 4 samples took: 2.922405481338501s.
     Call: 18 of 20 with 4 samples took: 2.982257843017578s.
     Call: 19 of 20 with 4 samples took: 2.8227620124816895s.

Time sleep(20)s

     Call: 0 of 20 with 4 samples took: 3.329136848449707s.
     Call: 1 of 20 with 4 samples took: 0.5629911422729492s.
     Call: 2 of 20 with 4 samples took: 0.5595850944519043s.
     Call: 3 of 20 with 4 samples took: 0.5578911304473877s.
     Call: 4 of 20 with 4 samples took: 0.5557725429534912s.
     Call: 5 of 20 with 4 samples took: 0.5681345462799072s.
     Call: 6 of 20 with 4 samples took: 0.5488979816436768s.
     Call: 7 of 20 with 4 samples took: 0.5555169582366943s.
     Call: 8 of 20 with 4 samples took: 0.5792186260223389s.
     Call: 9 of 20 with 4 samples took: 0.9297688007354736s.
     Call: 10 of 20 with 4 samples took: 0.6043572425842285s.
     Call: 11 of 20 with 4 samples took: 0.572312593460083s.
     Call: 12 of 20 with 4 samples took: 0.5600907802581787s.
     Call: 13 of 20 with 4 samples took: 2.9460437297821045s.
     Call: 14 of 20 with 4 samples took: 0.5780775547027588s.
     Call: 15 of 20 with 4 samples took: 0.5762953758239746s.
     Call: 16 of 20 with 4 samples took: 0.5773897171020508s.
     Call: 17 of 20 with 4 samples took: 0.5769815444946289s.
     Call: 18 of 20 with 4 samples took: 0.5663411617279053s.
     Call: 19 of 20 with 4 samples took: 0.579679012298584s.

System information