Open karstm opened 1 year ago
Hi Matthias, You can also try running the experiments on your own PC. People have reported considerable speed-up when running the experiments on their own PCs (As expected). See issue #10.
Some other people have commented that launching the jobs during the morning seems to help preventing errors.
Hi Miguel, thank you for your answer! It does run locally. It seems that task 1 will take about 4 hours on a 3.3 GHz 6-core i5...
I do however have to question what the point of executing the same task with the same parameters for tenths of students exactly is?
Using tenths of computing hours for a graded exercise seems counterproductive to me.
Best Matthias
Running the experiments with the provided settings is meant to be a baseline for you to experience how a successful run of the experiments should look like. However, even with the default parameters, different experiments will lead to slightly different results due to the randomness involved in the training process.
Asking students to perform hyperparameters tuning from the scratch would have been too time/compute intensive. At least with this experience, we hope you can get an idea of the challenges involved in training RL policies for high-dimensional settings.
If you want to dive deeper, you can also run different hyperparameters for the bonus section of the exercise.
Of course, feedback and ideas on how to improve this type of assignment for future iterations are more than welcome. Feel free to send me an email.
I also have this happen constantly, and it even happens locally sometimes. As a workaround, I've been running the training without xvfb-run
part and at the same time removing the -vr
flag. Training works fine and completes without any annoying xvfb-run
related errors. @MiguelZamoraM is it OK?
@billyzs Locally, you don't need to xvfb-run. That tool is meant to generate videos on a server environment. Please put the links to your experiments in the readme. So, that I can take a look.
Thanks you again for your answer. My projection was wrong. A run takes 1.5 to 2 hours locally.
I do agree that it is a helpful experience to do the training and visually see the improvements.
I also understand that it was not expected for the training to randomly crash on the cluster, when designing the exercise. It's just a bad feeling if things fail that are beyond your control in a graded setting.
I will think about ways that could elevate this problem and write an email if I come up with something.
Best Matthias
@MiguelZamoraM I just pushed the links. Actually, for the runs that I didn't use xvfb-run there were no videos generated. If the video is part of the grading, I can record locally using the saved model weights. Would that be OK?
+1, I tried 25 times since the weekend, 3 times it worked by randomness 2 jobs are still missing...
@MiguelZamoraM I just pushed the links. Actually, for the runs that I didn't use xvfb-run there were no videos generated. If the video is part of the grading, I can record locally using the saved model weights. Would that be OK?
@billyzs The video part is really more about your educational experience. Without the visualization of how the policies behave while you are training, it can be difficult to judge how well or how badly things are working. From the plots, you can get an idea that the training works well, but when we record the videos we are also running the policies for a longer time horizon that the one that was set during training. So, this is also a bit of a test for generalization.
I would encourage you to run the experiments with the -vr argument to generate the videos. You could also create the videos with the saved model weights but then you would have to figure out how to add those files to the corresponding experiments in weights and biases.
If time constraints don't allow you to run all the experiments with videos, keep the links of the experiments that you already have. Those will get you most of the points.
+1, I tried 25 times since the weekend, 3 times it worked by randomness 2 jobs are still missing...
@johannesg98 As suggested, try to run things locally. Also since you have tried running so many experiments and each experiment generates lots of files depending on when things crash, keep in mind to check the amount of disk space that you have available using the command lquota
. A full disk might also cause jobs to crash.
Hi, I can not get any job to finish on the cluster. I tried both conda and venv. I tried to not run any tasks in parallel.
They crash quite early for me (within the first hour of training). With the error message
/usr/bin/xvfb-run: line 186: kill: (xxxxxx) - No such process
in the xxx_reward.err file.This is mentioned in the closed issue #8, however there isn't really a solution other than: try again. This can't be the goal of a graded exercise where one subtask takes hours.
All my rewards pass on github.
It's pretty frustrating
Best Matthias