Nukesor / pueue

:stars: Manage your shell commands.
Apache License 2.0
5k stars 132 forks source link

User Feedback - Please read if you have some time #351

Closed Nukesor closed 2 years ago

Nukesor commented 2 years ago

Hey :)

Pueue has grown into a fairly popular project, at least if one is looking at the Github star count.

However, I don't really get a lot of feedback from Pueue's users. I don't even know how many of you are out there. Hundreds, thousands, tens of thousands, I really don't know!

The idea of this thread is to give you an easy way to provide me with some feedback. Here are a few things I'm quite curious about:

You don't have to answer all or any questions at all, I'm just looking forward to get some constructive feedback or, in the best case, some messages of users who tell me that they're simply satisfied with the current state of the project :)

mjpieters commented 2 years ago

I am extremely happy with pueue. It fits my needs very, very well. It's why I have contributed to the project, and probably will again in future.

I currently used it to handle hundreds of downloads, including setting limits on concurrency in a multi-step pipeline. I have a series of small bash scripts that help out, but most of the hard work is in pueue. Some items need to have the URL resolved, but that can only run one process at a time, so that's first queue. It'll put items into the download queue once the URL is resolved. The download queue has a higher parallel number set (I tend to tweak that), then completed downloads are further processed (unpacking, resizing, detecting issues). etc. The downloads can take anywhere between 20 minutes and 3 hours to complete, and I run this on a puny old HP microserver that's over a decade old, so post-processing runs a single job at a time, and these can last up to 30 minutes as well. The jobs flow through these three queues effortlessly. And if downloads stall, restarting the failed jobs is just a pueue restart -ia away.

I've created a few small jq scripts to help monitor it all; I've shared my pueue_status.jq script (gives a nice, compact summary, complete with ansi colors, of all queue jobs), as well as my wget job summariser script.

I'm not yet aware of others using it, but I'll happily gush about it the moment there is an opportunity ;-)

Thank you, again, for creating this project!

Nukesor commented 2 years ago

Thanks a lot for your detailed response and your kind words!

I'm amazed to see Pueue being used to drive such a elaborate setup :D. I didn't expect that to be honest, but it's awesome to see that its feature set is suited for such a task!

SammyRamone commented 2 years ago

I am very happy with pueue (just one pain point, see below). I use it to run experiments for my PhD thesis. This means mostly robotic simulators to either optimize parameters or for reinforcement learning. The experiments for this paper were run with pueue https://www.researchgate.net/publication/362076721_Bipedal_Walking_on_Humanoid_Robots_through_Parameter_Optimization The task typically run between a few hours and a few days. Often I want to run the same task multiple times (science :tada: ), this is easy to do as I just repeat the terminal command to add the task a few times. I use multiple machines each with its own instance of pueue. This was a bit confusing first because some pueue files where in the home dir which was shared across these machines, but I could quickly resolve this by putting some files locally on the machines. Naturally, pueue does no management across machines, but this is not necessary for my use case. I just roughly balance the load on the machines when adding the tasks. I typically know before how long they will take. I also run multiple tasks in parallel and sometimes use the group feature, i.e. to balance the usage of two GPUs in the same machine. I also really like the logging of pueue because I often need to look into the logs.

The only real issue that I had is the killing of running tasks. My tasks often consist of multiple layers of children processes (due to multiple simulators and inverse kinematic solvers being run in one reinforcement learning process). If I use the normal kill function of pueue it will leave zombie processes of my simulators which will use up my memory. I also tried sending signals via pueue to terminate the processes correctly, but this also did not work out. In the end I did not invest further work into it as it was not super important for me and I have no experience programming Rust, so forking pueue was no option for me.

I think that other members of the Hamburg Bit-Bots use it for experiments, too. For example @Flova. You can ping me for feedback.

Nukesor commented 2 years ago

@SammyRamone Nice, thanks for you feedback!

It's cool to know that Pueue is (presumably?) running on the RRZ cluster :D

Naturally, pueue does no management across machines, but this is not necessary for my use case.

I already got 2-3 requests to add cluster scheduling functionality to pueue. There's definitely a need for a modern cluster scheduling system! This would be such a cool project, sadly I really don't have a use for this :D

If I use the normal kill function of pueue it will leave zombie processes of my simulators which will use up my memory.

This is actually a known problem and should be covered by the --children flag, which does exactly what you described. It sends the signal not only to the direct children, but rather to the children's children as well. In case this still doesn't work, your processes are nested too deeply and some component doesn't properly forward signals. The only known way to resolve this is to catch the signal in your scripts and propagate them accordingly :/

SammyRamone commented 2 years ago

It's cool to know that Pueue is (presumably?) running on the RRZ cluster :D

It is actually just a few machines in the lab, no real cluster.

I already got 2-3 requests to add cluster scheduling functionality to pueue. There's definitely a need for a modern cluster scheduling system! This would be such a cool project, sadly I really don't have a use for this :D

Yeah, it seems like all cluster scheduling systems are old and not intuitive. But I think it is good to keep pueue like it is, otherwise it would get more complex.

This is actually a known problem and should be covered by the --children flag, which does exactly what you described. It sends the signal not only to the direct children, but rather to the children's children as well. In case this still doesn't work, your processes are nested too deeply and some component doesn't properly forward signals. The only known way to resolve this is to catch the signal in your scripts and propagate them accordingly :/

I tried that flag and it did not work either. But it can be that it is either an issue by me, in the RL framework that I use or in the simulator. I have a lot of layers of sub processes. As I said, it is just not really worth the effort because in the case that I need to kill it I can use htop to kill the zombie processes.

Flova commented 2 years ago

I think that other members of the Hamburg Bit-Bots use it for experiments, too. For example @Flova.

Yeah, I used it so schedule many long-running experiments/ablations in these three projects

I really prefer it over a screen session, and it is also nice that the logs are stored/organized neatly, so you can go back and look what command was started and what the output as well are exit code was.

It's cool to know that Pueue is (presumably?) running on the RRZ cluster :D

As @SammyRamone said, these experiments are running on our own workstations (5 Nodes with Ryzen 9 or Threadripper + GPU & 10 Gig Network), but we are in contact with the RRZ regarding additional resources.

I had some issues in the past where the pueue daemon was terminated, but some config stuff regarding the execution on systemd user services of users with no active session solved it for me.

chipbuster commented 2 years ago

I'm a happy user of several months.

I currently use it to schedule long-running Julia simulations for research purposes. I really like being able to see what flags I ran with and how long things took without having to explicitly instrumenting my code, something which my previous approach (the old tmux-and-detach) made very difficult to do. It makes planning subsequent runs much easier when I can see what I did last time and how long it took.

The only issue I've had is that if my job uses too much memory, pueued gets OOM-killed. I don't think there's really a way to avoid that, and also I should stop running jobs that use that much RAM.

I had a colleague ask me how to set up a SLURM queue on their personal machine, and I pointed them to pueue instead---I don't know if they ended up using it, but I definitely consider this to be the best program out there right now for managing work queues on a personal scale.

Nukesor commented 2 years ago

Thanks for you feedback everybody :)

It gave me a nice understanding on how this project is used and some of the heavy-duty usecases you guys manage to handle with it :)

If anyone of you needs plans to build a commercially backed open-source solution for a modern pueue-like cluster task scheduling system, feel free to text me :D

Nukesor commented 1 year ago

Hey :)

The v3.0.0 release candidate has just been published and it would be awesome if you could try it out. There's a new process handling logic via process groups, and a bit of heavy-duty testing would probably be a good idea :)