jobs keep failing with /usr/bin/xvfb-run: line 186: kill: (xxxxxx) - No such process - Githubissues

Digital-Humans-23 / a2

4 stars 0 forks source link

jobs keep failing with /usr/bin/xvfb-run: line 186: kill: (xxxxxx) - No such process #15

Open karstm opened 1 year ago

karstm commented 1 year ago

Hi, I can not get any job to finish on the cluster. I tried both conda and venv. I tried to not run any tasks in parallel.

They crash quite early for me (within the first hour of training). With the error message /usr/bin/xvfb-run: line 186: kill: (xxxxxx) - No such process in the xxx_reward.err file.

This is mentioned in the closed issue #8, however there isn't really a solution other than: try again. This can't be the goal of a graded exercise where one subtask takes hours.

All my rewards pass on github.

It's pretty frustrating

Best Matthias

MiguelZamoraM commented 1 year ago

Hi Matthias, You can also try running the experiments on your own PC. People have reported considerable speed-up when running the experiments on their own PCs (As expected). See issue #10.

Some other people have commented that launching the jobs during the morning seems to help preventing errors.

karstm commented 1 year ago

Hi Miguel, thank you for your answer! It does run locally. It seems that task 1 will take about 4 hours on a 3.3 GHz 6-core i5...

I do however have to question what the point of executing the same task with the same parameters for tenths of students exactly is?

Using tenths of computing hours for a graded exercise seems counterproductive to me.

Best Matthias

MiguelZamoraM commented 1 year ago

Running the experiments with the provided settings is meant to be a baseline for you to experience how a successful run of the experiments should look like. However, even with the default parameters, different experiments will lead to slightly different results due to the randomness involved in the training process.

Asking students to perform hyperparameters tuning from the scratch would have been too time/compute intensive. At least with this experience, we hope you can get an idea of the challenges involved in training RL policies for high-dimensional settings.

If you want to dive deeper, you can also run different hyperparameters for the bonus section of the exercise.

Of course, feedback and ideas on how to improve this type of assignment for future iterations are more than welcome. Feel free to send me an email.

billyzs commented 1 year ago

I also have this happen constantly, and it even happens locally sometimes. As a workaround, I've been running the training without xvfb-run part and at the same time removing the -vr flag. Training works fine and completes without any annoying xvfb-run related errors. @MiguelZamoraM is it OK?

MiguelZamoraM commented 1 year ago

@billyzs Locally, you don't need to xvfb-run. That tool is meant to generate videos on a server environment. Please put the links to your experiments in the readme. So, that I can take a look.

karstm commented 1 year ago

Thanks you again for your answer. My projection was wrong. A run takes 1.5 to 2 hours locally.

I do agree that it is a helpful experience to do the training and visually see the improvements.

I also understand that it was not expected for the training to randomly crash on the cluster, when designing the exercise. It's just a bad feeling if things fail that are beyond your control in a graded setting.

I will think about ways that could elevate this problem and write an email if I come up with something.

Best Matthias

billyzs commented 1 year ago

@MiguelZamoraM I just pushed the links. Actually, for the runs that I didn't use xvfb-run there were no videos generated. If the video is part of the grading, I can record locally using the saved model weights. Would that be OK?

johannesg98 commented 1 year ago

+1, I tried 25 times since the weekend, 3 times it worked by randomness 2 jobs are still missing...

MiguelZamoraM commented 1 year ago

@MiguelZamoraM I just pushed the links. Actually, for the runs that I didn't use xvfb-run there were no videos generated. If the video is part of the grading, I can record locally using the saved model weights. Would that be OK?

@billyzs The video part is really more about your educational experience. Without the visualization of how the policies behave while you are training, it can be difficult to judge how well or how badly things are working. From the plots, you can get an idea that the training works well, but when we record the videos we are also running the policies for a longer time horizon that the one that was set during training. So, this is also a bit of a test for generalization.

I would encourage you to run the experiments with the -vr argument to generate the videos. You could also create the videos with the saved model weights but then you would have to figure out how to add those files to the corresponding experiments in weights and biases.

If time constraints don't allow you to run all the experiments with videos, keep the links of the experiments that you already have. Those will get you most of the points.

MiguelZamoraM commented 1 year ago

+1, I tried 25 times since the weekend, 3 times it worked by randomness 2 jobs are still missing...

@johannesg98 As suggested, try to run things locally. Also since you have tried running so many experiments and each experiment generates lots of files depending on when things crash, keep in mind to check the amount of disk space that you have available using the command lquota. A full disk might also cause jobs to crash.