allegroai / clearml-serving

ClearML - Model-Serving Orchestration and Repository Solution
https://clear.ml
Apache License 2.0
137 stars 40 forks source link

Deploying Models from Azure Blob #43

Open ockaro opened 1 year ago

ockaro commented 1 year ago

Models which are located on the clearML servers (created by Task.init(..., output_uri=True) ) run perfectly while models which are located on azure blob storage produce different problems in different scenarios:

  1. start the docker container, add a model from the clearML server and afterwards add a model located on azure (on the same endpoint) -> no error, http requests are answered properly (but probably the model which was added first is used)
  2. start the docker container with no model added and first add a model from azure -> error: test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory .
  3. start the docker container where a model from azure was already added before -> error:
    clearml-serving-triton        | Error retrieving model ID ca186e8440b84049971a0b623df36783 []
    clearml-serving-triton        | Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
    clearml-serving-triton        | Traceback (most recent call last):
    clearml-serving-triton        |   File "clearml_serving/engines/triton/triton_helper.py", line 540, in <module>
    clearml-serving-triton        |     main()
    clearml-serving-triton        |   File "clearml_serving/engines/triton/triton_helper.py", line 532, in main
    clearml-serving-triton        |     helper.maintenance_daemon(
    clearml-serving-triton        |   File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
    clearml-serving-triton        |     raise ValueError("triton-server process ended with error code {}".format(error_code))
    clearml-serving-triton        | ValueError: triton-server process ended with error code 1

Side note: The same problem occurs hosting the containers on windows and on linux. All azure credentials are succesfully set up as envioronment variables in 'clearml-serving-inference', 'clearml-serving-triton' and 'clearml-serving-statistics' containers.

thepycoder commented 1 year ago

Hi There!

Thanks again for the detailed write-up. Would you mind testing if the following fix works? It seems like the clearml config file is not mounted inside the necessary containers. Make sure your Azure credentials are added in this config file :)

So you'd add:

    volumes:
      - $HOME/clearml.conf:/root/clearml.conf

to here: https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L77 and here: https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L107

If you can confirm this is working, we can make a PR and get this issue sorted out. Thanks a lot for your patience and cooperation!!

ockaro commented 1 year ago

Hi @thepycoder , thanks for your answer and sorry for my late reply. At least I managed to try your recommendations today and had the following findings on my local windows machine: (btw I am using the docker-compose-triton.yml not the GPU version)

  1. When I just added the volume like you suggested I got the error msg="The \"HOME\" variable is not set. Defaulting to a blank string." right after calling docker-compose. Setting the HOME environment variable did not work so I added it to the .env file which is passed in the docker-compose and got rid of the error.
  2. I then needed to manually confirm on a popup that the docker container is allowed to access the clearml.conf file. This was not really an issue for now but could be when running solely via terminal?
  3. Fortunately, I got a promising additional error message, everything else remained as before.
    clearml-serving-triton        | E0217 10:21:25.908301 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
    clearml-serving-triton        | Info: syncing models from main serving service
    clearml-serving-triton        | Updating local model folder: /models
    clearml-serving-triton        | 2023-02-17 10:21:26,079 - clearml.storage - ERROR - Azure blob storage driver not found. Please install driver using: 'pip install clearml[azure]' or pip install '"azure.storage.blob>=12.0.0"'
    clearml-serving-triton        | Error retrieving model ID 9075dbebef6d4467801da808a6e39570 []
    clearml-serving-triton        | Info: Models updated from main serving service
    clearml-serving-triton        | reporting metrics: relative time 123 sec
    clearml-serving-inference     | Instance [3cf8c573a03e4341aa6f422465d5521b, pid=8]: New configuration updated
    clearml-serving-inference     | ClearML results page: https://app.clear.ml/projects/c8794acd9c594f4e9f9a9a55b9b76632/experiments/3cf8c573a03e4341aa6f422465d5521b/output/log
    clearml-serving-inference     | ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoringclearml-serving-inference   
    clearml-serving-inference     | ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

    So it seems like the azure blob storage driver is not set up properly in the docker container? In the environment where I call docker-compose the requirement is already satisfied.

thepycoder commented 1 year ago

Hey @ockaro!

Thanks for checking back in!

  1. Interesting, we'll take a look at this!
  2. Could this be docker on windows behaviour? Or was it a popup of ClearML? Because the whole serving stack doesn't have a UI, I would think the popup is from docker itself, which we can do little about (we should fix it though, by not needing you to mount it manually in the first place)
  3. Could you try adding the following in your docker-compose config under the triton container: CLEARML_EXTRA_PYTHON_PACKAGES="azure-storage-blob" This should install the blob storage for you. If this works, we'll add it to the default requirements :)
ockaro commented 1 year ago

Hi @thepycoder, thanks again for your reply.

  1. Yes it was a popup from docker so proper mounting will probably fix this.
  2. I tried your tip and it worked! Thanks for the kind and smooth handling of my issue. :)

Do you need any further information?

thepycoder commented 1 year ago

@ockaro Awesome, thanks a lot for your patience here! We don't need anything else and are working to make the process more painless in the future. Thank you so much for your contributions!