Digital-Humans-23 / a2

4 stars 0 forks source link

Job starting time #5

Open Pablo-Paniagua opened 1 year ago

Pablo-Paniagua commented 1 year ago

How long are we to wait, under normal circumstances, for our job to start running in the cluster (given the provided default configuration is used)?

EDIT [To summarize and close]:

Expected time for job start

The cluster appears to have a range of starting time for the jobs (dependent on the time of the day and load I guess). I have experienced anything between 3 minutes and 4-5 hours (at 8 am vs 5 pm job submission time). Starting jobs in the morning seems to help them to get to run sooner

Running multiple jobs in parallel

To help reduce the time needed to run tests you can run multiple jobs in parallel in the cluster. This is very useful in exercise 2. To do this create a copy of the conf .json file of interest and change the values are requested in the README.md. In the jobs folder in the cluster (or locally and through git) also create a new job file and change it to take the new configuration file as part of the inputs. Start that job as you would do any other. The cluster will queue them and you can check them as individual runs there and in W&B.

I have personally gotten 3 jobs to run at the same time, but I am sure more is possible.

Note: Do not forget to also change the error and output file names when running multiple jobs in parallel to avoid overwriting during execution

MiguelZamoraM commented 1 year ago

The maximum waiting time that I have experienced is around 90 mins, but usually, jobs start after 3min approx. This also depends on the priority that you have (which in turn depends on the usage that you have had).

Please, keep me posted on the waiting times that you are experiencing.

Pablo-Paniagua commented 1 year ago

45 minutes so far. Job not started yet. Will keep you posted.

JunTu-XD commented 1 year ago

Same issue. My jobs have been waiting for around 4 hours until now.

Pablo-Paniagua commented 1 year ago

Took also around 4 hours for my job to start running. Any way to aid in getting it to run sooner than that?

MiguelZamoraM commented 1 year ago

That is a new record. Could you try submitting more than one job? Some people have been able to submit up to 3 jobs. The waiting time is still long but at least you could run more experiments in parallel.

xuexianlim commented 1 year ago

When submitting my jobs in the morning, there doesn't seem to be much waiting time. I submitted jobs at night and they took more than an hour to start although that might be a one-off thing.

Pablo-Paniagua commented 1 year ago

That is a new record. Could you try submitting more than one job? Some people have been able to submit up to 3 jobs. The waiting time is still long but at least you could run more experiments in parallel.

Do you have any suggested strategy to run multiple jobs in parallel in an efficient manner?

Pablo-Paniagua commented 1 year ago

When submitting my jobs in the morning, there doesn't seem to be much waiting time. I submitted jobs at night and they took more than an hour to start although that might be a one-off thing.

I can confirm this. I just started a job (this morning) and the wait time was only 3 minutes.

Pablo-Paniagua commented 1 year ago

Would I need separate virtual environments (conda pylocoEnv2 and pylocoEnv3) to run the jobs in parallel?

JunTu-XD commented 1 year ago

No, just submit them via sbatch.

Pablo-Paniagua commented 1 year ago

No, just submit them via sbatch.

But for running ex2_1, ex2_2 and ex2_3 in parallel I need separate repo checkouts due to the different conf files right?

MiguelZamoraM commented 1 year ago

You can always create new config files and new job files for those experiments.

Pablo-Paniagua commented 1 year ago

You can always create new config files and new job files for those experiments.

OC. My bad.

Pablo-Paniagua commented 1 year ago

I can confirm that 3 jobs are feasible to run in parallel. Works very well for tasks 2.1, 2.2 and 2.3