NREL / reV

Renewable Energy Potential (reV) Model
https://nrel.github.io/reV/
BSD 3-Clause "New" or "Revised" License
101 stars 24 forks source link

QA-QC for exclusion layer cannot be submitted, but no stdout log created #191

Closed nickwg03 closed 4 years ago

nickwg03 commented 4 years ago

Bug Description I'm trying to run the exclusion layer QA-QC module, but the submission will hang for a while and then return a message that the job couldn't kick-off (in the .log file) and refers the user to check the stdout file, but no stdout file is created for the job.

I've had trouble getting the exclusion layer QA-QC module to run in the past, usually requiring a few attempts to get it to run (especially when running several jobs at the same time via the batch approach). I notice that the "name" is always 'reV' when running the exclusion layer QAQC, so I'm wondering if the duplicative name is causing it to be hung up?

Full Traceback

INFO - 2020-08-06 09:30:08,931 [cli_qa_qc.py:372] : Running reV supply curve from config file: "/lustre/eaglefs/shared-projects/rev/projects/h
eco/rev/agg_batched/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0/config_qa-qc.json"
INFO - 2020-08-06 09:30:08,931 [cli_qa_qc.py:372] : Running reV supply curve from config file: "/lustre/eaglefs/shared-projects/rev/projects/h
eco/rev/agg_batched/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0/config_qa-qc.json"
INFO - 2020-08-06 09:30:08,932 [cli_qa_qc.py:373] : Target output directory: "/lustre/eaglefs/shared-projects/rev/projects/heco/rev/agg_batche
d/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0/"
INFO - 2020-08-06 09:30:08,932 [cli_qa_qc.py:373] : Target output directory: "/lustre/eaglefs/shared-projects/rev/projects/heco/rev/agg_batche
d/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0/"
INFO - 2020-08-06 09:30:08,932 [cli_qa_qc.py:374] : Target logging directory: "/lustre/eaglefs/shared-projects/rev/projects/heco/rev/agg_batch
ed/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0/logs/"
INFO - 2020-08-06 09:30:08,932 [cli_qa_qc.py:374] : Target logging directory: "/lustre/eaglefs/shared-projects/rev/projects/heco/rev/agg_batch
ed/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0/logs/"
INFO - 2020-08-06 09:30:08,933 [qa_qc_config.py:240] : QA/QC using the following pipeline input for excl_fpath: /shared-projects/rev/projects/
heco/data/exclusions/HI_Exclusions.h5
INFO - 2020-08-06 09:30:08,934 [qa_qc_config.py:263] : QA/QC using the following pipeline input for excl_dict: {'dod': {'exclude_values': [1]}
, 'fedland': {'exclude_values': [1, 2, 3, 4, 5]}, 'floodzones': {'exclude_values': [1]}, 'hi_gadm_adm1': {'include_values': [57]}, 'hi_gadm_ad
m2': {'exclude_values': [2693]}, 'hi_gadm_adm2_island_id': {'include_values': [1, 2, 3, 4, 5]}, 'impag': {'exclude_values': [1]}, 'inclusion_1
_3_custom_urban_exclusions': {'exclude_values': [1, 2, 3]}, 'lavaflow_kilauea': {'exclude_values': [1]}, 'lavazones': {'exclude_values': [1, 2
]}, 'slope': {'inclusion_range': [0, 5]}, 'stateland': {'exclude_values': [2, 4, 5, 6, 7, 8, 9, 10]}, 'urban': {'exclude_values': [1]}, 'wetla
nds': {'exclude_values': [1, 2, 3, 4, 5, 6, 7]}}
INFO - 2020-08-06 09:30:08,935 [qa_qc_config.py:287] : QA/QC using the following pipeline input for area_filter_kernel: queen
INFO - 2020-08-06 09:30:08,936 [qa_qc_config.py:310] : QA/QC using the following pipeline input for min_area: 0.1
INFO - 2020-08-06 09:30:08,937 [cli_qa_qc.py:651] : Running reV QA-QC on SLURM with node name "run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fc
r0052_ay0_ed0_QA-QC"
INFO - 2020-08-06 09:30:08,937 [cli_qa_qc.py:651] : Running reV QA-QC on SLURM with node name "run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fc
r0052_ay0_ed0_QA-QC"
INFO - 2020-08-06 09:30:08,964 [cli_qa_qc.py:673] : Was unable to kick off reV QA-QC job "run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052
_ay0_ed0_QA-QC". Please see the stdout error messages
INFO - 2020-08-06 09:30:08,964 [cli_qa_qc.py:673] : Was unable to kick off reV QA-QC job "run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052
_ay0_ed0_QA-QC". Please see the stdout error messages

But then as noted above, no stdout file is created.

To Reproduce Check outputs at /shared-projects/rev/projects/heco/rev/agg_batched/run_pv_fixed_afk0_ma01_gf0_pd36_rcd0_rf0_td0_fcr0052_ay0_ed0 on Eagle.

Charge code WTSA.11658.01.01.01

MRossol commented 4 years ago

I think this might be an eagle bug. I can't get jobs to submit either. I have a ticket in. Also what is up with that directory name!??!

nickwg03 commented 4 years ago

@MRossol it's a batch run, which is why the dirname is so long.

I was doing fine up until the QA-QC stage. I had run aggregation, supply curve, rep profiles, and finally at QA-QC it would fail, but not always. When I'd run 1 of the batch runs it sometimes works, sometimes doesn't.

MRossol commented 4 years ago

@nickwg03 it doesn't look like you included qa-qc in the batch config:

"files": [
                "./config_aggregation.json",
                "./config_supply-curve.json",
                "./config_rep-profiles.json"
            ],
nickwg03 commented 4 years ago

Issues was that multiple jobs were using the same job name.

@MRossol fixed the QAQC module to use the appropriate name for each job.