geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
58 stars 22 forks source link

process_rsmas.py fails on XSEDE Comet #323

Closed Ovec8hkin closed 4 years ago

Ovec8hkin commented 4 years ago

When attempting to run the test code for process_rsmas.py on the SDSC Comet HPC cluster, the initial process_rsmas job doesn't get submitted properly as far as I can tell (no jobs are shown when running squeue -u $USER).

Below is what is printed to console:

*************** Template Options ****************
Custom Template File:  /home/jaz101/environments/dev/rsmas_insar/samples/GalapagosSenDT128.template
Project Name:  GalapagosSenDT128
Work Dir:  /oasis/scratch/comet/jaz101/temp_project/GalapagosSenDT128
1-->1
-1 -0.6 -91.4 -90.86-->'-1 -0.6 -91.4 -90.86'
auto-->auto
None-->None
1-->1
-1 -0.6 -91.4 -90.86-->'-1 -0.6 -91.4 -90.86'
template file exists: /oasis/scratch/comet/jaz101/temp_project/GalapagosSenDT128/GalapagosSenDT128.template, no updates
Run routine processing with process_rsmas.py on steps: ['download', 'dem', 'ifgrams', 'timeseries', 'insarmaps', 'imageProducts']
--------------------------------------------------
20200325:175016 * ##### NEW RUN #####
20200325:175016 * process_rsmas.py /home/jaz101/environments/dev/rsmas_insar/samples/GalapagosSenDT128.template --submit
error code 1 b''
process_rsmas.job submitted as SLURM job #rsmas.job99999
rsmas.job99999  #This is not the right job number

Total time: 00 mins 0.6 secs

Obviously rsmas.job99999 is not the correct job number for the job.

I believe this is tied to some code in minsar.job_submission; specifically, the submit_single_job function. The following block of code look suspicious to me:

elif scheduler == 'SLURM':
        hostname = subprocess.Popen("hostname", shell=True, stdout=subprocess.PIPE).stdout.read().decode("utf-8")
        if hostname.startswith('login'):
            command = "sbatch {}".format(os.path.join(work_dir, job_file_name))
        else:
            job_num = '{}99999'.format(job_file_name.split('_')[1])
            command = "srun {} > {} 2>{} ".format(os.path.join(work_dir, job_file_name),
                                                  os.path.join(work_dir, job_file_name.split('.')[0] +
                                                               '_{}.o'.format(job_num)),
                                                  os.path.join(work_dir, job_file_name.split('.')[0] +
                                                               '_{}.e'.format(job_num)))
mirzaees commented 4 years ago

Thank you @Ovec8hkin , yes that probably won't work on Comet, I am separating different job submission schemes based on platforms. give me some time to fix it

Ovec8hkin commented 4 years ago

Ok. I figured it was just different submission scheme that hadn’t been tested yet. This is a low priority issue.

Ovec8hkin commented 4 years ago

I think we can fix this problem for all possible schedulers with a simple regex command to pull out the job number: \d+. This should match all numeric digits in the output string.

I believe replacing this code in job_submission.submit_single_job():

    if scheduler == "LSF":
        # works for 'Job <19490923> is submitted to queue <general>.\n'
        job_number = output.decode("utf-8").split("\n")[1].split("<")[1].split(">")[0]
    elif scheduler == "PBS":
        # extracts number from '7319.eos\n'
        # job_number = output.decode("utf-8").split("\n")[0].split(".")[0]
        # uses '7319.eos\n'
        job_number = output.decode("utf-8").split("\n")[0]
    elif scheduler == 'SLURM':
        try:
            job_number = str(output).split("\\n")[-2].split(' ')[-1]
        except:
            job_number = job_num

with this:

import re
job_number = re.search(r'\d+', output).group(0)

I can probably test this soon.

mirzaees commented 4 years ago

I used regex as @Ovec8hkin suggested, it works almost for all cases except when you are in a compute node and call srun instead of sbatch. It is very rare to use this command and if we do, job_number= 99999 is only for checking outputs