Project-MONAI / monai-deploy

MONAI Deploy aims to become the de-facto standard for developing, packaging, testing, deploying and running medical AI applications in clinical production.
Apache License 2.0
98 stars 22 forks source link

Workflow won't trigger/complete on Linux server(with no GPU) #137

Closed justinhorton2003 closed 11 months ago

justinhorton2003 commented 1 year ago

I'm trying to remotely deploy monai-deploy-express on a Linux development server that doesn't have an NVIDIA GPU currently. Seeing as, seemingly, the sample workflows (Lung Seg) provided in the deploy express documentation completed on my own machine with CPU power (no GPU usage spike), I thought deploying to a Linux server without a dedicated GPU would be fine. Creating and running the containers on the remote server went fine, I defined the workflows and got back Workflow IDs and uploaded the CT data to Orthanc. But, when I sent to DICOM modality, the segmentation task would never finish and return the segmented images. I checked docker container list -a but couldn't find the container MONAI Lung Seg which would be responsible for the task. I tried adding NVIDIA Runtime to the Docker Daemon, installing CUDA and NVIDIA Toolkit, to see if the workflow would at least trigger then (even though there is no NVIDIA GPU). It did trigger and I could see the MONAI Lung Seg container running, but nothing would ever complete. How can I make monai deploy express execute and complete at least the sample workflows on my GPU-less remote Linux server? Also, when I do docker logs and the container ID for the Lung Seg container it doesn't display anything (maybe @mocsharp can swoop in and save the day again) Cheers

SameerShanbhogue commented 1 year ago

Similar Issue

No final output.

image

image

SameerShanbhogue commented 1 year ago

Is GPU mandated for Monai Deploy Express setup?

Shambhuraj11 commented 1 year ago

Hi, I'm also facing same issue. I am using GPU currently. I don't think this issue is related to gpu. I am also able to receive workflow id but it is not creating container for monai-liver-seg app. It is neither executing MAP nor giving output and not even giving any error in monai-deploy-express

mocsharp commented 1 year ago

Hi @justinhorton2003 @SameerShanbhogue, @Shambhuraj11

Could you please share the version of MDE that you are running?

According to @MMelQin, the Lung Seg workflow should run on CPU.

I also tried the latest (0.4.0), after I submit the workflow and send the dataset from Orthanc. I find the docker container that the Task Manager started:

docker container ls -a
CONTAINER ID   IMAGE                                                           COMMAND                  CREATED             STATUS                          PORTS                                                                                                                                                 NAMES
39309f19f86   ghcr.io/mmelqin/monai_ai_lung_seg_app:1.0                       "/bin/bash -c 'pytho…"   21 seconds ago   Up 19 seconds              6006/tcp, 8888/tcp                                                                                                                                    0ae77d9a-e14c-4c75-963a-b5f24a2ad9c0
b516a31be9d3   ghcr.io/mmelqin/monai_ai_livertumor_seg_app:1.0                 "/bin/bash -c 'pytho…"   2 minutes ago       Exited (0) About a minute ago                                                                                                                                                         7c65f7ec-2eac-4829-ba43-6210a77e03e0

I was able to see the logs from the containers and the segmentation output in Orthanc after a few minutes.

If possible, please attach the logs for us to investigate:

docker logs mdl-ig> mdl-ig.txt
docker logs mdl-wm > mdl-wm.txt
docker logs mdl-tm > mdl-tm.txt

Thank you!

/cc @JHancox

JHancox commented 1 year ago

Sorry to hear of this issue. Unfortunately I'll be on vacation for a couple of weeks in about 5 mins time! If still on-going, will look into this when I return (>=21st August).

MMelQin commented 1 year ago

I'm trying to remotely deploy monai-deploy-express on a Linux development server that doesn't have an NVIDIA GPU currently. Seeing as, seemingly, the sample workflows (Lung Seg) provided in the deploy express documentation completed on my own machine with CPU power (no GPU usage spike), I thought deploying to a Linux server without a dedicated GPU would be fine. Creating and running the containers on the remote server went fine, I defined the workflows and got back Workflow IDs and uploaded the CT data to Orthanc. But, when I sent to DICOM modality, the segmentation task would never finish and return the segmented images. I checked docker container list -a but couldn't find the container MONAI Lung Seg which would be responsible for the task. I tried adding NVIDIA Runtime to the Docker Daemon, installing CUDA and NVIDIA Toolkit, to see if the workflow would at least trigger then (even though there is no NVIDIA GPU). It did trigger and I could see the MONAI Lung Seg container running, but nothing would ever complete. How can I make monai deploy express execute and complete at least the sample workflows on my GPU-less remote Linux server? Also, when I do docker logs and the container ID for the Lung Seg container it doesn't display anything (maybe @mocsharp can swoop in and save the day again) Cheers

Thank you @justinhorton2003 for your question and the steps to isolate the issue. As you had verified, the example MONAI Application Package (MAP) works without GPU, though you must have noticed, the inference takes a very long time, ~30 mins; with a GPU (with 6+ GB of memory), the whole application, DICOM in and DICOM results out, finishes in less than 60s.

Anyway, the issue observed when exercising the MONAI Deploy Express is likely around the launching of the MAP container by the Task Manager plugin, a component in MDE. It is in fact the LIFECYCLE MANAGEMENT of the container, as evident in the workflow definition for the lung seg MAP, the container TIMEOUT, task_timeout_minutes, is set at 5mins.

Please bump up this time out value based on the time you had observed when running the MAP standalone with CPU only, in the MDE container. Setting it to 0 also works but not

@JHancox @mocsharp

MMelQin commented 1 year ago

Hi @SameerShanbhogue thanks for the question. Not sure if the same error you saw still persists. The Task Manager failed to launch the MAP container due to its failure to load libnvidai-ml.so. Also please see my reply above with regard to the long execution time of MAP container if on CPU, and the solution. @mocsharp @JHancox

SameerShanbhogue commented 1 year ago

mdl-ig.txt mdl-tm.txt mdl-wm.txt @mocsharp

SameerShanbhogue commented 1 year ago

@mocsharp image

justinhorton2003 commented 1 year ago

Hey! Thanks everyone for all the interaction. I believe I am running the latest version(0.4?), I pulled the repo a couple weeks ago. But I can double check in a bit. I'll also try the suggested solution. Thanks a lot @MMelQin @mocsharp and everyone else for responding.

Shambhuraj11 commented 1 year ago

Hii @mocsharp. Thanks for replying. My issue is resolved now for sample workflows.

SameerShanbhogue commented 1 year ago

Hi

I find the issue is related to pre-requisite environment stated for implementation NIVIDIA CUDA Toolkit and NVIDIA Container Toolkit which I believe requires GPU. After moving to AWS EC2 with GPU the issue gets resolved.

MMelQin commented 11 months ago

@SameerShanbhogue Great to hear that issue was resolved, so I will mark this as closed.