Open Pablo-Paniagua opened 1 year ago
The maximum waiting time that I have experienced is around 90 mins, but usually, jobs start after 3min approx. This also depends on the priority that you have (which in turn depends on the usage that you have had).
Please, keep me posted on the waiting times that you are experiencing.
45 minutes so far. Job not started yet. Will keep you posted.
Same issue. My jobs have been waiting for around 4 hours until now.
Took also around 4 hours for my job to start running. Any way to aid in getting it to run sooner than that?
That is a new record. Could you try submitting more than one job? Some people have been able to submit up to 3 jobs. The waiting time is still long but at least you could run more experiments in parallel.
When submitting my jobs in the morning, there doesn't seem to be much waiting time. I submitted jobs at night and they took more than an hour to start although that might be a one-off thing.
That is a new record. Could you try submitting more than one job? Some people have been able to submit up to 3 jobs. The waiting time is still long but at least you could run more experiments in parallel.
Do you have any suggested strategy to run multiple jobs in parallel in an efficient manner?
When submitting my jobs in the morning, there doesn't seem to be much waiting time. I submitted jobs at night and they took more than an hour to start although that might be a one-off thing.
I can confirm this. I just started a job (this morning) and the wait time was only 3 minutes.
Would I need separate virtual environments (conda pylocoEnv2 and pylocoEnv3) to run the jobs in parallel?
No, just submit them via sbatch.
No, just submit them via sbatch.
But for running ex2_1, ex2_2 and ex2_3 in parallel I need separate repo checkouts due to the different conf files right?
You can always create new config files and new job files for those experiments.
You can always create new config files and new job files for those experiments.
OC. My bad.
I can confirm that 3 jobs are feasible to run in parallel. Works very well for tasks 2.1, 2.2 and 2.3
How long are we to wait, under normal circumstances, for our job to start running in the cluster (given the provided default configuration is used)?
EDIT [To summarize and close]:
Expected time for job start
The cluster appears to have a range of starting time for the jobs (dependent on the time of the day and load I guess). I have experienced anything between 3 minutes and 4-5 hours (at 8 am vs 5 pm job submission time). Starting jobs in the morning seems to help them to get to run sooner
Running multiple jobs in parallel
To help reduce the time needed to run tests you can run multiple jobs in parallel in the cluster. This is very useful in exercise 2. To do this create a copy of the conf .json file of interest and change the values are requested in the
README.md
. In the jobs folder in the cluster (or locally and through git) also create a new job file and change it to take the new configuration file as part of the inputs. Start that job as you would do any other. The cluster will queue them and you can check them as individual runs there and in W&B.I have personally gotten 3 jobs to run at the same time, but I am sure more is possible.
Note: Do not forget to also change the error and output file names when running multiple jobs in parallel to avoid overwriting during execution