A process queue and pueue allocates them to several groups as soon as the group's running processes is lower than parallel number.

crazyn2 commented 8 months ago

A detailed description of the feature you would like to see added.

A process queue and pueue allocates them from queue to several groups as soon as the group's running processes is lower than parallel number.

Explain your usecase of the requested feature

I have three gpu, but the processes running in them didn't finished at the same time. I want a function that pueue could allocate process in the free gpu or lower usage gpu and I can add process into queue at any time.

Alternatives

No response

Additional context

No response

Nukesor commented 8 months ago

Hey @crazyn2

Something probably got lost in translation and I'm not 100% sure what your requested feature would look like.

From what I've gathered, this sounds a lot like the feature requested in https://github.com/Nukesor/pueue/issues/218. This has been implemented, even though it really isn't documented anywhere. Somebody should probably write a Wiki page for this :sweat_smile: .

This feature allows tasks to be called like this: command --gpu $PUEUE_WORKER_ID some_other_parameters. If multiple jobs should run per GPU, this would need to be handled by the command.

Pueue is no high-performance scheduler with complex logic and it's not planned to make it one. There're other tools out there that specificaly tackle complex cluster job management.

In case I completely misunderstood your feature request: Could you write a more detailed explanation, for instance with a proper example and the exact way it would work, including a step-by-step description of the behavior?

crazyn2 commented 8 months ago

I apologize for not expressing myself clearly. I want to have a queue where I can add processes freely and then set a limit on the number of processes running on the cluster, for example, 2. Whenever the number of running processes in any group is less than 2, it will extract one from the queue to that underutilized group. I've read the #218 and wiki. However, It doesn't meet my expectations. This is my bash shell which has the function I expected:

mkfifo mylist
exec 4<>mylist
rm -rf mylist

# 数组锁
mkfifo mylist1
exec 5<> mylist1
# exec 5<> mylist1
rm -rf mylist1
# echo "${cuda_arry[@]}">&$arr_lock
# echo "0 0 0 0">&
# cuda2gpu=(1 2 0)
# gpu2cuda=(2 0 1)
if [ -z "$gpu_pool" ]; then
    gpu_pool=($((low_prc_num*2-2)) "$((low_prc_num+1))" "$((low_prc_num+1))")
fi
pool_sum=0
for(( i=0;i<${#gpu_pool[@]};i++)); do
    pool_sum=$((pool_sum+${gpu_pool[$i]}))
done;

for ((i=0; i < pool_sum; i++)); do
    echo >&4
done
echo "0 0 0">&5

acquire_cuda(){
    # set -x
    local cuda_arry
    read -r -u5 -a cuda_arry
    # echo "${cuda_arry[@]}"

    for i in "${!cuda_arry[@]}";
    do   
        if [ "${cuda_arry[$i]}" -lt "${gpu_pool[$i]}" ]; then

            cuda_arry[i]=$((${cuda_arry[$i]}+1))

            export CUDA_VISIBLE_DEVICES=$i
            break
        fi
    done
    echo "${cuda_arry[@]}" >&5
    # set +x
}
release_cuda(){
    local cuda_arry
    read -ru5 -a cuda_arry
    # echo $CUDA_VISIBLE_DEVICES
    index=$CUDA_VISIBLE_DEVICES
    cuda_arry[index]=$((${cuda_arry[$index]}-1))
    echo "${cuda_arry[@]}" >&5

}
ten_classes(){
    for num in {0..9}; do
        read -ru4
        echo "$num"
        {
            acquire_cuda "$num"
            eval "$cmd"
            release_cuda
            echo >&4
        } &
        sleep 1
    done
}

I have three GPU cards and I've set the maximum number of processes per card to 2. When a process finishes, it sends a signal to &4. The new process checks the number of running processes on each card and is assigned to the card with fewer than 2 running processes.

Nukesor commented 7 months ago

I thought a lot about this and this is nothing that'll be added to Pueue. Pueue is not designed to be a complex task scheduler, but rather a small scheduler for server maintainer and hobbyists.

I still think that it's possible to write a wrapper script with the current functionality to map worker ids to a external worker pools of varying size, but this is no logic that'll be added to pueue.

You might want to look at professional cluster management systems. I think our University used slurm for such tasks.

Anyhow, thanks for the detailed feature request :) Have a nice day!

Nukesor / pueue