The script uses the functionality added in #2487 to get the max RSS memory usage from the pytorch_inference process. The benchmark sends batches of inferences to be evaluated followed by a get memory usage request and prints a summary of memory usage vs batch size.
The only complication is that pytorch_inference handles control messages on the main thread while model evaluation is off-loaded to a thread pool. To ensure that the inference requests and get memory request are processed sequentially and in order I added an --useImmediateExecutor flag to pytorch_inference. When set the immediate executor is used to process inference requests. This option should only be used for benchmarking.
The script uses the functionality added in #2487 to get the max RSS memory usage from the
pytorch_inference
process. The benchmark sends batches of inferences to be evaluated followed by a get memory usage request and prints a summary of memory usage vs batch size.The only complication is that
pytorch_inference
handles control messages on the main thread while model evaluation is off-loaded to a thread pool. To ensure that the inference requests and get memory request are processed sequentially and in order I added an--useImmediateExecutor
flag topytorch_inference
. When set the immediate executor is used to process inference requests. This option should only be used for benchmarking.