aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
134 stars 70 forks source link

Cannot change the CloudWatch log level #83

Open oonisim opened 4 years ago

oonisim commented 4 years ago

Describe the bug The PyTorch SageMaker endpoint cloudwatch log level is INFO only which cannot be changed without creating a BYO container.

Hence all the access including /ping besides the /invocations are generating logs that clutters the cloudwarch log stream making it difficult to go directly to the errors for troubleshooting. In my understanding, this will incur the cloudwatch cost as well.

2020-08-28 00:31:16,598 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:54100 "GET /ping HTTP/1.1" 200 0Copy2020-08-28 00:31:16,598 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:54100 "GET /ping HTTP/1.1" 200 0
2020-08-28 00:31:21,598 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:54100 "GET /ping HTTP/1.1" 200 0Copy2020-08-28 00:31:21,598 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:54100 "GET /ping HTTP/1.1" 200 0
... repetition of the INFO on /ping.

The AWS support case 7309023801 was opened and it was indicated the log level cannot be changed, or need to build our own container to control the log level.


the sagemaker python inference serving container module with earlier version need to change the log4j from the sagemaker inference toolkit as it was built based on the sagemaker inference module, we need to change the log4j in the sagemaker inference toolkit. So I have built two wheel files for my custom docker container and installed both of them in the Dockerfile:

=========
#RUN pip install --no-cache-dir "sagemaker-pytorch-inference<2"
COPY sagemaker_inference-1.5.3.dev0-py2.py3-none-any.whl /tmp
RUN pip install /tmp/sagemaker_inference-1.5.3.dev0-py2.py3-none-any.whl

COPY sagemaker_pytorch_inference-1.5.1-py2.py3-none-any.whl /tmp
RUN pip install /tmp/sagemaker_pytorch_inference-1.5.1-py2.py3-none-any.whl
=========

Then I built and pushed the docker to my ECR and used that docker image to deploy the model to an endpoint. I was only seeing the logs related to the model server, for example:

=========
EVENTS  1598507953058   2020-08-27 05:59:07,670 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 533 1598507947905
EVENTS  1598508601335   2020-08-27 06:09:55,987 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 62  1598508596097
EVENTS  1598508607396   2020-08-27 06:10:01,440 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 292 1598508602099
=========

The ping health check logs are all gone. Please note that, as the logs for the invocation requests handled by the model server is logged at INFO level, so I would suggest to keep the log4j setting as below:

=========
log4j.logger.com.amazonaws.ml.mms = INFO, mms_log
log4j.logger.ACCESS_LOG = ERROR, access_log
log4j.logger.MMS_METRICS = WARN, mms_metrics
log4j.logger.MODEL_METRICS = WARN, model_metrics
log4j.logger.MODEL_LOG = WARN, model_log
=========
Now let's come back to the issue of changing logging level. The environment variable you used 'SAGEMAKER_CONTAINER_LOG_LEVEL' is the correct one. From our sagemaker sdk source code [2][3], you can see that the service used the same value when it is set in the python sdk. I have used our example notebook and used the parameter in the PyTorchModel function as below:

----
sagemaker_model = PyTorchModel(
                                  model_data = 's3://<mybucket>/pytorch-training-2020-08-26-00-44-33-303/output/model.tar.gz',
                                  role = role,
                                  container_log_level=40,
                                  py_version='py3',
                                  entry_point = 'mnist.py')
----

I can confirm the environment variable set up for the model is 
Key                                                               Value
SAGEMAKER_CONTAINER_LOG_LEVEL   40

However, this value is only for the logs that are generated by python logging (entrypoint script).  I have tested this with a training job and can see the logs only shows DEBUG or ERROR. I modified the entrypoint script to print the environment variables and also trying to use logger.info and logger.error to add test logs to the model_fn as below:

----
def model_fn(model_dir):
    logger.info("Run model function")
    print(os.environ)
    logger.error("display error message")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.nn.DataParallel(Net())
    with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
        model.load_state_dict(torch.load(f))
    return model.to(device)
----

I can confirm that the logger.info is not showing in the cloudwatch and the 'display error message' logged by logger.error was showing in the cloudwatch log. This means the container log level is correct.

As in pytorch serving containers, we are using 'mxnet-model-server' which handles the \ping and \invocation http calls for you [8]. The model server will not be controlled by the container logging level. The logs that you have seen in the endpoint shown as [INFO] is defined here [5] in the log4j.properties. For example, in the endpoint logs, you may see the below entry:

2020-08-26 03:27:10,540 [INFO ] pool-1-thread-2 ACCESS_LOG - /10.32.0.2:52222 "GET /ping HTTP/1.1" 200 22

This is defined in the "log4j.logger.ACCESS_LOG = INFO, access_log" section in [5]. 

Therefore, to change the log levels for the serving module you need to build your own pytorch image based on our github repo [6]. But this will require you to make necessary code change to our Dockerfile as well as installing additional modules. 

To reproduce Deploy a PyTorch Model where Python log level is set to logging.ERROR via SageMaker SDK and refer to the cloudwach log for /aws/sagemaker/Endpoints/.

import logging

logging.basicConfig()
logging.getLogger().setLevel(logging.ERROR)

Expected behavior The log level configuration is reflected and only ERROR will be logged in the cloudwatch.

System information SageMaker endpoint in us-east-1.

Additional context

Endpoint startup message in the cloudwatch.

Warning: Calling MMS with mxnet-model-server. Please move to multi-model-server.
2020-08-28 00:29:08,131 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /opt/conda/lib/python3.6/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of CPUs: 1
Max heap size: 3806 M
Python executable: /opt/conda/bin/python
Config file: /etc/sagemaker-mms.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Model Store: /.sagemaker/mms/models
Initial Models: ALL
Log dir: /logs
Metrics dir: /logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2020-08-28 00:29:08,222 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-model
2020-08-28 00:29:08,320 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_pytorch_serving_container.handler_service --model-path /.sagemaker/mms/models/model --model-name model --preload-model false --tmp-dir /home/model-server/tmp
2020-08-28 00:29:08,321 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2020-08-28 00:29:08,321 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 43
2020-08-28 00:29:08,321 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2020-08-28 00:29:08,321 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.10
2020-08-28 00:29:08,322 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model model loaded.
2020-08-28 00:29:08,328 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2020-08-28 00:29:08,341 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2020-08-28 00:29:08,406 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
Model server started.
2020-08-28 00:29:08,408 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2020-08-28 00:29:08,409 [WARN ] pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2020-08-28 00:29:08,972 [INFO ] W-9000-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - PyTorch version 1.5.1 available.
2020-08-28 00:29:42,837 [INFO ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 34388
2020-08-28 00:29:42,839 [WARN ] W-9000-model com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-model-1
2020-08-28 00:29:55,821 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 19
2020-08-28 00:30:00,606 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 1
2020-08-28 00:30:05,605 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 0
2020-08-28 00:30:10,605 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 0
2020-08-28 00:30:15,605 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 0
2020-08-28 00:30:20,605 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 0
2020-08-28 00:30:25,605 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 0
2020-08-28 00:30:30,605 [INFO ] pool-1-thread-3 ACCESS_LOG - /127.0.0.1:34704 "GET /ping HTTP/1.1" 200 0
...
chuyang-deng commented 3 years ago

related: https://github.com/aws/sagemaker-inference-toolkit/issues/70

icywang86rui commented 3 years ago

I think the easiest way of implementing this would be allowing the customer to provide their own log4j config file through the dependencies arg here. The file should follow a naming convention, something like ./override/etc/log4j.properties. And in the container side we just us the custom override config file if it exists here

j-adamczyk commented 1 year ago

+1, this would be very useful

j-adamczyk commented 1 year ago

Any news on this? This is very problematic for anything using PySpark (both training and inference), which outputs a lot of logs, and 99% are totally useless

is-abhi commented 6 months ago

+1, any update?