allegroai / clearml-serving

ClearML - Model-Serving Orchestration and Repository Solution
https://clear.ml
Apache License 2.0
138 stars 40 forks source link

Triton inference server fails to load checkpointed PyTorch Ignite model #5

Closed ecm200 closed 1 year ago

ecm200 commented 3 years ago

The Triton server is now able to find the local copy of the model weight pt file and attempts to serve it, following fixes in #3.

The following error occurs when the model is served by the Triton Inference server:

Starting Task Execution:

clearml-serving - Nvidia Triton Engine Helper
ClearML results page: https://clearml-server.westeurope.cloudapp.azure.com/projects/779be4f4d83541d786eb839bb062fa93/experiments/364c73e36a454842a314169d78514034/output/log
String Triton Helper service
{'serving_id': 'b978817fa0544b94b2015b420a96f14c', 'project': 'serving', 'name': 'nvidia-triton', 'update_frequency': 10, 'metric_frequency': 1, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None}

Updating local model folder: /models
[INFO]:: URL: cub200_resnet34 Endpoint: ServingService.EndPoint(serving_url='cub200_resnet34', model_ids=['57ed24c1011346d292ecc9e797ccb47e'], model_project=None, model_name=None, model_tags=None, model_config_blob='\n            platform: "pytorch_libtorch"\n            input [\n                {\n                    name: "input_layer"\n                    data_type: TYPE_FP32\n                    dims: [ 3, 224, 224 ]\n                }\n            ]\n            output [\n                {\n                    name: "fc"\n                    data_type: TYPE_FP32\n                    dims: [ 200 ]\n                }\n            ]\n        ', max_num_revisions=None, versions=OrderedDict())
[INFO]:: Model ID: 57ed24c1011346d292ecc9e797ccb47e Version: 1
[INFO]:: Model ID: 57ed24c1011346d292ecc9e797ccb47e Model URL: azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,447 - clearml.storage - INFO - Downloading: 5.00MB / 81.72MB @ 18.80MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,730 - clearml.storage - INFO - Downloading: 13.00MB / 81.72MB @ 28.29MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,741 - clearml.storage - INFO - Downloading: 21.00MB / 81.72MB @ 684.91MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,760 - clearml.storage - INFO - Downloading: 29.00MB / 81.72MB @ 426.19MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,791 - clearml.storage - INFO - Downloading: 37.00MB / 81.72MB @ 258.86MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,806 - clearml.storage - INFO - Downloading: 45.00MB / 81.72MB @ 535.17MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,907 - clearml.storage - INFO - Downloading: 53.00MB / 81.72MB @ 79.03MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,963 - clearml.storage - INFO - Downloading: 61.72MB / 81.72MB @ 155.64MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,968 - clearml.storage - INFO - Downloading: 69.72MB / 81.72MB @ 1502.19MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,979 - clearml.storage - INFO - Downloading: 77.72MB / 81.72MB @ 790.76MBs from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt
2021-06-10 15:20:54,985 - clearml.storage - INFO - Downloaded 81.72 MB successfully from azure://clearmllibrary/artefacts/Caltech Birds%2FTraining/TRAIN [Network%3A resnet34, Library%3A torchvision] Ignite Train PyTorch CNN on CUB200.8611ada5be6f4bb6ba09cf730ecd2253/models/cub200_resnet34_ignite_best_model_0.pt , saved to /clearml_agent_cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
[INFO] Local path to the model: /clearml_agent_cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
Update model v1 in /models/cub200_resnet34/1
[INFO] Target Path:: /models/cub200_resnet34/1/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
[INFO] Local Path:: /clearml_agent_cache/storage_manager/global/e38f6052e6b887337635fc2821a6b5d4.cub200_resnet34_ignite_best_model_0.pt
[INFO] New Target Path:: /models/cub200_resnet34/1/model.pt
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=600.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
I0610 15:20:55.182775 671 metrics.cc:221] Collecting metrics for GPU 0: Tesla P40
I0610 15:20:55.498654 671 libtorch.cc:940] TRITONBACKEND_Initialize: pytorch
I0610 15:20:55.498688 671 libtorch.cc:950] Triton TRITONBACKEND API version: 1.0
I0610 15:20:55.498699 671 libtorch.cc:956] 'pytorch' TRITONBACKEND API version: 1.0
2021-06-10 15:20:55.688775: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0610 15:20:55.729429 671 tensorflow.cc:1880] TRITONBACKEND_Initialize: tensorflow
I0610 15:20:55.729458 671 tensorflow.cc:1890] Triton TRITONBACKEND API version: 1.0
I0610 15:20:55.729464 671 tensorflow.cc:1896] 'tensorflow' TRITONBACKEND API version: 1.0
I0610 15:20:55.729473 671 tensorflow.cc:1920] backend configuration:
{}
I0610 15:20:55.731061 671 onnxruntime.cc:1728] TRITONBACKEND_Initialize: onnxruntime
I0610 15:20:55.731085 671 onnxruntime.cc:1738] Triton TRITONBACKEND API version: 1.0
I0610 15:20:55.731095 671 onnxruntime.cc:1744] 'onnxruntime' TRITONBACKEND API version: 1.0
I0610 15:20:55.756821 671 openvino.cc:1166] TRITONBACKEND_Initialize: openvino
I0610 15:20:55.756848 671 openvino.cc:1176] Triton TRITONBACKEND API version: 1.0
I0610 15:20:55.756854 671 openvino.cc:1182] 'openvino' TRITONBACKEND API version: 1.0
I0610 15:20:56.081773 671 pinned_memory_manager.cc:205] Pinned memory pool is created at '0x7f229c000000' with size 268435456
I0610 15:20:56.082099 671 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864
I0610 15:20:56.083854 671 model_repository_manager.cc:1065] loading: cub200_resnet34:1
I0610 15:20:56.184287 671 libtorch.cc:989] TRITONBACKEND_ModelInitialize: cub200_resnet34 (version 1)
I0610 15:20:56.185272 671 libtorch.cc:1030] TRITONBACKEND_ModelInstanceInitialize: cub200_resnet34 (device 0)

1623338462128 ecm-clearml-compute-gpu-002:gpuall DEBUG I0610 15:20:59.633139 671 libtorch.cc:1063] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0610 15:20:59.633184 671 libtorch.cc:1012] TRITONBACKEND_ModelFinalize: delete model state
E0610 15:20:59.633206 671 model_repository_manager.cc:1242] failed to load 'cub200_resnet34' version 1: Internal: failed to load model 'cub200_resnet34': [enforce fail at inline_container.cc:227] . file not found: archive/constants.pkl
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x68 (0x7f23c6279498 in /opt/tritonserver/backends/pytorch/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::getRecordID(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xda (0x7f23a1a23d4a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #2: caffe2::serialize::PyTorchStreamReader::getRecord(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x38 (0x7f23a1a23da8 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #3: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0xab (0x7f23a323508b in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #4: <unknown function> + 0x3c035e5 (0x7f23a32355e5 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #5: <unknown function> + 0x3c05fd0 (0x7f23a3237fd0 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #6: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1ab (0x7f23a32391eb in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #7: torch::jit::load(std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc2 (0x7f23a323b332 in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #8: torch::jit::load(std::istream&, c10::optional<c10::Device>) + 0x6a (0x7f23a323b41a in /opt/tritonserver/backends/pytorch/libtorch_cpu.so)
frame #9: <unknown function> + 0x104a6 (0x7f23c67d44a6 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #10: <unknown function> + 0x12ac4 (0x7f23c67d6ac4 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #11: <unknown function> + 0x13772 (0x7f23c67d7772 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #12: TRITONBACKEND_ModelInstanceInitialize + 0x374 (0x7f23c67d7b34 in /opt/tritonserver/backends/pytorch/libtriton_pytorch.so)
frame #13: <unknown function> + 0x2f8a99 (0x7f24104a8a99 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #14: <unknown function> + 0x2f927c (0x7f24104a927c in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #15: <unknown function> + 0x2f77ec (0x7f24104a77ec in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #16: <unknown function> + 0x183c00 (0x7f2410333c00 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #17: <unknown function> + 0x191581 (0x7f2410341581 in /opt/tritonserver/bin/../lib/libtritonserver.so)
frame #18: <unknown function> + 0xd6d84 (0x7f240fcead84 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #19: <unknown function> + 0x9609 (0x7f2410185609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #20: clone + 0x43 (0x7f240f9d8293 in /lib/x86_64-linux-gnu/libc.so.6)

I0610 15:20:59.633540 671 server.cc:500] 
+-----------------...</char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char></char>

Originally posted by @ecm200 in https://github.com/allegroai/clearml-serving/issues/3#issuecomment-858868722

ecm200 commented 3 years ago

@bmartinn in #3 makes reference to an issue raised with PyTorch, which looks like it might be the potential source of the issue.

pytorch/pytorch#47917

Investigating now and will report back.

ecm200 commented 3 years ago

Solution found to Triton Inference Server PyTorch problem

The issue here was the way the model was being saved.

The input for the Triton Inference server is a PyTorch model that as been saved using the TorchScript export utility. I had been using a checkpointed file saved by PyTorch Ignite checkpointer, which was just a set of weights of that could be loaded into a model.

Creating a Traced Torchscript version of the PyTorch model from checkpoint weights

In order to get the model loaded by the Triton Inference Server, the model needs to be converted into TorchScript. This can be achieved by first building your model and then loading the weights into the model, as if you were going to perform inference. This can be achieved in the following way:

  1. Build you PyTorch model object as before.

  2. Load weights using:

model.load_state_dict(torch.load(f=checkpoint_file))

  1. Trace the model to convert it to Torchscript and save to disk.

First generate an example input tensor for the model, this can be just an array of random numbers of the right input size, or a batch from a dataloader, it doesn't matter. Then use the torch.jit.trace method to create a traced module of the model. This traced module object is still able to take inputs and produce model outputs, however it is now executing as Torchscript code through the Torchscript C API rather than the python torch package. The save() method of the traced module can then be used to save the model to disk. It is this file that you provide to the Triton Inference Server for deployment.

# Get a validation batch
X, y = next(iter(val_loader))
# Set the model into eval mode
model.eval()
# Push input images to gpu
X = X.to(device)
# Trace the model
traced_module = torch.jit.trace(model, (X))
# Save the trace model module to disk ready for deployment
traced_module.save('model.pt'))

Triton Configuration

Using this model.pt file with the Triton Inference Server has allowed me to get inference working. The config.pbtxt file was as follows:

name: "cub200_resnet34"
            platform: "pytorch_libtorch"
            input [
                {
                    name: "INPUT__0"
                    data_type: TYPE_FP32
                    dims: -1
                    dims: 3
                    dims: 224
                    dims: 224
                }
            ]
            output [
                {
                    name: "OUTPUT__0"
                    data_type: TYPE_FP32
                    dims: -1
                    dims: 200
                }
            ]

The following docker command (manually started for testing) was run to start the Triton server:

docker run --gpus=1 --rm --ipc=host -p8000:8000 -p8001:8001 -p8002:8002 -v/home/edmorris/models_repo:/models nvcr.io/nvidia/tritonserver:21.03-py3 tritonserver --model-repository=/models

The successful model serving looks like this:

+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0611 14:14:37.897200 1 server.cc:527]
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I0611 14:14:37.897278 1 server.cc:570]
+-----------------+---------+--------+
| Model           | Version | Status |
+-----------------+---------+--------+
| cub200_resnet34 | 1       | READY  |
+-----------------+---------+--------+

I0611 14:14:37.897359 1 tritonserver.cc:1658]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                              |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                             |
| server_version                   | 2.8.0                                                                                                                                              |
| server_extensions                | classification sequence model_repository schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0]         | /models                                                                                                                                            |
| model_control_mode               | MODE_NONE                                                                                                                                          |
| strict_model_config              | 1                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                           |
| min_supported_compute_capability | 6.0                                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                                 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

I0611 14:14:37.898756 1 grpc_server.cc:3983] Started GRPCInferenceService at 0.0.0.0:8001
I0611 14:14:37.898976 1 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000
I0611 14:14:37.940972 1 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002

Testing inference of the Triton deployed model.

To test the inference was running correctly, I created a python script to make a dataloader, build the model from the checkpoint file in PyTorch, and execute both a python based inference and a inference served to the Triton server. The resulting class predictions were then compared.

This snippet is not totally complete, you need to create the model and a dataloader to serve batch images.

import argparse
import numpy as np
import sys
from functools import partial
import os
from tritonclient import grpc
import tritonclient.grpc.model_config_pb2 as mc
from tritonclient import http
from tritonclient.utils import triton_to_np_dtype
from tritonclient.utils import InferenceServerException
import torch
from clearml import InputModel, Task
import shutil
import pathlib

def run_inference(X, X_shape=(3, 224,  224), X_dtype='FP32', model_name='cub200_resnet34', input_name=['INPUT__0'], output_name='OUTPUT__0',
                  url='ecm-clearml-compute-gpu-002.westeurope.cloudapp.azure.com', model_version='1', port=8000, VERBOSE=False):
    url = url+':'+str(port)
    triton_client = http.InferenceServerClient(url=url, verbose=VERBOSE)
    model_metadata = triton_client.get_model_metadata(model_name=model_name, model_version=model_version)
    model_config = triton_client.get_model_config(model_name=model_name, model_version=model_version)

    input0 = http.InferInput(input_name[0], X_shape, X_dtype)
    input0.set_data_from_numpy(X, binary_data=False)
    output = http.InferRequestedOutput(output_name,  binary_data=False)
    response = triton_client.infer(model_name, model_version=model_version, inputs=[input0], outputs=[output])
    y_pred_proba = response.as_numpy(output_name)
    y_pred = y_pred_proba.argmax(1)

    return y_pred_proba, y_pred

# Get a validation batch
X, y = next(iter(val_loader))
# Set the model into eval mode
model.eval()
# Push input images to gpu
X_gpu = X.to(device)
# Run inference on validatgion batch image
y_prob_pred = model(X_gpu)
# Get predicted classes
_, y_pred = torch.max(y_prob_pred, 1)

# Get Triton served predicted classes
y_pred_proba_remote, y_pred_remote = run_inference(X.numpy(), X.shape)

print('Result:: \ty\t\t:: {} \n\t \ty_pred[local]\t:: {} \n\t \ty_pred[triton]\t:: {} '.format(y.numpy(),y_pred.cpu().numpy(),y_pred_remote))
print('')
ecm200 commented 3 years ago

What is the best way to add the Torchscript model to the experiment?

Would it be through the use of the OutputModel class? What combination of calls would be best to use to upload the model? Is it ok to have more than 1 output model associated with an experiment? [I think it is, but I just wanted to be sure].

I am thinking at the moment, the Ignite ClearMLSaver() function handles the saving of the model checkpoint and its uploading to the clearml-server, with it being pushed to remote Azure storage. How do I ensure that the location of the storage is the same? Is it autogenerated?

Basically, what I would like achieve is pushing the Torchscript exported model to same folder location as the PyTorch model weights, and thus having both those files organized together.

bmartinn commented 3 years ago

What is the best way to add the Torchscript model to the experiment? Would it be through the use of the OutputModel class?

Hmm, I think a straight forward solution would be to convert the model.pt at the end of the training process, then use OutputModel to store it.

# conversion code here
final_model = OutputModel()
final_model.update_weights('final_model_here.pt', auto_delete_file=True)

Is it ok to have more than 1 output model associated with an experiment? [I think it is, but I just wanted to be sure].

It is fully supported. Notice that with clearml-server v1.0+ this is also visible on the Task Artifact and Models tab as well as inside the model repository.

How do I ensure that the location of the storage is the same? Is it autogenerated? Basically, what I would like achieve is pushing the Torchscript exported model to same folder location as the PyTorch model weights, and thus having both those files organized together.

If Task.init was called with output_uri (or default_output_uri configured in clearml.conf), then the OutputModel will automatically upload the weights file to the Azure storage into the Tasks' unique folder, right next to the other weights files. Do notice the file name should be unique, to avoid overwriting previous checkpoints :)

ecm200 commented 3 years ago

This function created a Torchscript version of my image classification model, added a model artefact to the experiment and created a new model object on the clearml-server, and uploaded the file to the experiment directory on the remote storage service.

This Torchscript model could then be used following the Triton serving example for clearml-serving , to be able to deploy and serve that model as a remote end point for inteference over HTTP.

Note, this snippet requires you to build a PyTorch model object and load the checkpoint weights of the best model from the model training, as well as a dataloader object for serving images for the model tracing [or just create tensors with random initialisation of the expected input size].

def trace_model_for_torchscript(self, dirname=None, fname=None, model_name_preamble=None):
        '''
        Function for tracing models to Torchscript.
        '''
        assert self.trainer_status['model'], '[ERROR] You must create the model to load the weights. Use Trainer.create_model() method to first create your model, then load weights.'
        assert self.trainer_status['val_loader'], '[ERROR] You must create the validation loader in order to load images. Use Trainer.create_dataloaders() method to create access to image batches.'

        if model_name_preamble is None:
            model_name_preamble = 'Torchscript Best Model'

        if dirname is None:
            dirname = tempfile.mkdtemp(prefix=f"ignite_torchscripts_{datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S_')}")
        temp_file_path = os.path.join(dirname,'model.pt')

        # Get the best model weights file for this experiment
        for chkpnt_model in self.task.get_models()['output']:
            print('[INFO] Model Found. Model Name:: {0}'.format(chkpnt_model.name))
            print('[INFO] Model Found. Mode URI:: {0}'.format(chkpnt_model.url))
            if "best_model" in chkpnt_model.name:
                print('[INFO] Using this model weights for creating Torchscript model.')
                break

        # Get the model weights file locally and update the model
        local_cache_path = chkpnt_model.get_local_copy()
        self.update_model_from_checkpoint(checkpoint_file=local_cache_path)

        # Create an image batch
        X, _ = next(iter(self.val_loader))
        # Push the input images to the device
        X = X.to(self.device)
        # Trace the model
        traced_module = torch.jit.trace(self.model, (X))
        # Write the trace module of the model to disk
        print('[INFO] Torchscript file being saved to temporary location:: {}'.format(temp_file_path))
        traced_module.save(temp_file_path) ### TODO: Need to work out where this is saved, and how to push to an artefact.

        # Build the remote location of the torchscript file, based on the best model weights
        # Create furl object of existing model weights
        model_furl = furl.furl(chkpnt_model.url)
        # Strip off the model path
        model_path = pathlib.Path(model_furl.pathstr)
        # Get the existing model weights name, and split the name from the file extension.
        file_split = os.path.splitext(model_path.name)
        # Create the torchscript filename
        if fname is None:
            fname = file_split[0]+"_torchscript"+file_split[1]
        # Construct the new full uri with the new filename
        new_model_furl = furl.furl(origin=model_furl.origin, path=os.path.join(model_path.parent,fname))

        # Upload the torchscript model file to the clearml-server
        print('[INFO] Pushing Torchscript model as artefact to ClearML Task:: {}'.format(self.task.id))
        new_output_model = OutputModel(
            task=self.task, 
            name=model_name_preamble+' '+self.task.name, 
            tags=['Torchscript','Deployable','Best Model', 'CUB200', self.config.MODEL.MODEL_NAME, self.config.MODEL.MODEL_LIBRARY, 'PyTorch', 'Ignite', 'Azure Blob Storage']
            )
        print('[INFO] New Torchscript model artefact added to experiment with name:: {}'.format(new_output_model.name))
        print('[INFO] Torchscript model local temporary file location:: {}'.format(temp_file_path))
        print('[INFO] Torchscript model file remote location:: {}'.format(new_model_furl.url))
        new_output_model.update_weights(
            weights_filename=temp_file_path,
            target_filename=fname
            )
        print('[INFO] Torchscript model file remote upload complete. Model saved to ID:: {}'.format(new_output_model.id))
qwaxys commented 1 year ago

Closing this issue as inactive, feel free to open a new issue and link to this one.