E3SM-Project / pace

Performance Analytics for Computational Experiments
https://pace.ornl.gov
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Debug PACE upload issue with WAV experiments #118

Closed sarats closed 2 years ago

sarats commented 2 years ago

  4 * Parsing e3sm_timing file: e3sm_timing.20220614.WCYCL1850-WW3.ne30pg2_EC30to60E2r2_wQU225EC30to60E2r2.anvil.571140.220616-165247.gz
  5      -Complete
  6 * Parsing README.docs file : README.case.571140.220616-165247.gz
  7 [ERROR]: list index out of range in file README.case.571140.220616-165247.gz
  8 * Parsing model timing file :timing.571140.220616-165247.tar.gz
  9 [ERROR]: 'NoneType' object has no attribute 'expid'
 10 [ERROR]: Other error during upload

Raw data at https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.sarat/sbrus.tgz

cc @sbrus89

sarats commented 2 years ago

README

2022-06-14 16:46:11: ./create_newcase --case /home/sbrus/run/E3SM_cases/20220614.WCYCL1850-WW3.ne30pg2_EC30to60E2r2_wQU225EC30to60E2r2.anvil --compset WCYCL1850-WW3 --res ne30pg2_EC30to60E2r2_wQ    U225EC30to60E2r2 --compiler intel --mach anvil --verbose

Check if any of the string vars are overflowing in DB schema?

For ref:

+----------------------------+------------------------+------+-----+---------+----------------+
| Field                      | Type                   | Null | Key | Default | Extra          |
+----------------------------+------------------------+------+-----+---------+----------------+
| expid                      | int(10) unsigned       | NO   | PRI | NULL    | auto_increment |
| case                       | varchar(200)           | NO   | MUL | NULL    |                |
| lid                        | varchar(50)            | NO   | MUL | NULL    |                |
| machine                    | varchar(25)            | NO   | MUL | NULL    |                |
| caseroot                   | varchar(250)           | NO   |     | NULL    |                |
| timeroot                   | varchar(250)           | NO   |     | NULL    |                |
| user                       | varchar(25)            | NO   | MUL | NULL    |                |
| exp_date                   | datetime               | NO   | MUL | NULL    |                |
| upload_date                | datetime               | YES  | MUL | NULL    |                |
| long_res                   | varchar(200)           | NO   |     | NULL    |                |
| res                        | varchar(100)           | NO   | MUL | NULL    |                |
| compset                    | varchar(100)           | NO   | MUL | NULL    |                |
| long_compset               | varchar(200)           | NO   |     | NULL    |                |
| stop_option                | varchar(25)            | NO   |     | NULL    |                |
| stop_n                     | int(10) unsigned       | NO   | MUL | NULL    |                |
| run_length                 | int(10) unsigned       | NO   | MUL | NULL    |                |
| total_pes_active           | int(10) unsigned       | NO   | MUL | NULL    |                |
| mpi_tasks_per_node         | int(10) unsigned       | NO   |     | NULL    |                |
| pe_count_for_cost_estimate | int(10) unsigned       | NO   |     | NULL    |                |
| model_cost                 | decimal(20,2) unsigned | NO   | MUL | NULL    |                |
| model_throughput           | decimal(20,2) unsigned | NO   | MUL | NULL    |                |
| actual_ocn_init_wait_time  | decimal(10,3) unsigned | NO   |     | NULL    |                |
| init_time                  | decimal(10,3) unsigned | NO   | MUL | NULL    |                |
| run_time                   | decimal(20,3) unsigned | NO   | MUL | NULL    |                |
| final_time                 | decimal(10,3) unsigned | NO   | MUL | NULL    |                |
| version                    | varchar(100)           | NO   | MUL | NULL    |                |
| upload_by                  | varchar(25)            | NO   |     | sarat   |                |
| case_group                 | varchar(200)           | YES  |     | NULL    |                |
| compiler                   | varchar(20)            | YES  |     | NULL    |                |
| mpilib                     | varchar(20)            | YES  |     | NULL    |                |
+----------------------------+------------------------+------+-----+---------+----------------+
gaurabkcutk commented 2 years ago

The index is out of range because there was a flag --verbose and the script tried to get the value but it went of of range.

2022-06-14 16:46:11: ./create_newcase --case /home/sbrus/run/E3SM_cases/20220614.WCYCL1850-WW3.ne30pg2_EC30to60E2r2_wQU225EC30to60E2r2.anvil --compset WCYCL1850-WW3 --res ne30pg2_EC30to60E2r2_wQU225EC30to60E2r2 --compiler intel --mach anvil --verbose

I have made a quick fix to check for inbound index gaurab/issue118 PR

sbrus89 commented 2 years ago

@gaurabkcutk, This makes sense. I don't think I started using the --verbose flag until recently, which is when I noticed my runs weren't being picked up by PACE. Will PACE go back and upload these past runs once this fix is in place, or is there something I need to do to initiate the upload. I appreciate the help on this!

gaurabkcutk commented 2 years ago

@sbrus89 It should auto pick up on the next upload run. @sarats correct me, if i misspoke here.

sbrus89 commented 2 years ago

Perfect, thanks!

Out of curiosity, how often are the upload runs?

gaurabkcutk commented 2 years ago

I believe its once per day.

sarats commented 2 years ago

@sbrus89 We need to manually upload the older experiments if you need them. The nightly upload will pick up all the new stuff automatically once the fix is in.

This month so far, I see

In /lcrc/group/e3sm/OLD_PERF
$ find 2022-06 -iname "sbrus"
2022-06/performance_archive_anvil_e3sm_2022_06_19_00_15_15/sbrus
2022-06/performance_archive_anvil_e3sm_2022_06_03_00_15_20/sbrus
2022-06/performance_archive_anvil_e3sm_2022_06_18_00_15_18/sbrus

Manual upload

cd /lcrc/group/e3sm/performance_archive
./pace-upload3 -ed /lcrc/group/e3sm/OLD_PERF/2022-06/performance_archive_anvil_e3sm_2022_06_19_00_15_15/sbrus
Repeat for other directories.
sarats commented 2 years ago

@sbrus89 I deployed the interim fix and uploaded your old exps from this month.

109851 to 109856.

You can see using https://pace.ornl.gov/exp-details/109851 etc.