aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
131 stars 70 forks source link

Update PyTorch Inference toolkit to log telemetry metrics #131

Closed sachanub closed 1 year ago

sachanub commented 1 year ago

Creating PR to update Pytorch Inference toolkit to log telemetry metrics in case of failure.

TorchServe PR: https://github.com/pytorch/serve/pull/1974

Telemetry Design Doc: https://quip-amazon.com/8hW4AJQQu4Na/Inference-Telemetry-Failure-rate-low-level-design

Please refer to this document for more details about the PR: https://quip-amazon.com/VMoqANQATkNH/Code-Changes-in-TorchServe-and-PyTorch-Inference-Toolkit-to-Log-Telemetry-Metrics

Example code block for logging telemetry metrics:

loggerTelemetryMetrics.info("ModelServerError.Count:1|#TorchServe:{},{}:-1", ConfigManager.getInstance().getVersion(), "error");

Sample output:

ModelServerError.Count:1|#TorchServe:0.6.1,error:-1
ModelServerError.Count:1|#TorchServe:0.6.1,error:-1
ModelServerError.Count:1|#TorchServe:0.6.1,error:-1
ModelServerError.Count:1|#TorchServe:0.6.1,error:-1

filtered_telemetry.log snippet_ts_log.log telemetry_cloudwatch_log