Kipok / NeMo-Skills

A pipeline to improve skills of large language models
https://kipok.github.io/NeMo-Skills/
Apache License 2.0
185 stars 41 forks source link

how to cleanup an interrupted run ? #188

Closed yld3 closed 3 weeks ago

yld3 commented 3 weeks ago

one task was interrupted while testing the omni-math dataset. When i tried to rerun the command

ns eval \
    --cluster=local \
    --model=/workspace/openmath2-llama3.1-8b-trtllm \
    --server_type=trtllm \
    --output_dir=/workspace/openmath2-llama3.1-8b-eval \
    --benchmarks=omni-math:0 \
    --server_gpus=1 \
    --num_jobs=1 \
    ++prompt_template=llama3-instruct \
    ++batch_size=512 \
    ++inference.tokens_to_generate=4096

i got an error

           ("Conflict. The container name "/nemo-run-0" is already in use by container
           "47cd595e14772f25e6d6f6b20e6a093efaad619b27176fbbae72cde6d4517feb". You have
           to remove (or rename) that container to be able to reuse that name.")

however i am uncertain how to cleanup this nemo-run-0 to start afresh. Could you please assist ? thanks!

EDIT: found the solution, just had to kill the outstanding docker processes still running when launching docker ps

Kipok commented 3 weeks ago

That's right - we will try to improve it in the future (see https://github.com/Kipok/NeMo-Skills/issues/155). But for now your approach is indeed the recommended way. Either run docker ps -a and then selectively kill hanging containers or just kill everything if you don't have any containers that you need to keep with a single command docker stop $(docker ps -a -q)