ksbeattie commented 5 years ago

Discussion (motivated by #321 ) on what types of errors are re-submittable and which ones are not. Can the sample size be automatically reduced, throwing out a few jobs that failed?

@CSRussell2319 will follow up with colleagues on viability of throwing failed jobs out or methods for confirming we have an accurate distribution.

sotorrio1 commented 5 years ago

I don't get it. How is this related to OUU?

CSRussell2319 commented 5 years ago

The issue occurs when running OUU. The problem is that the current implementation will toss out all completed runs in an iteration if even one of them results in a failure. This would waste a considerable amount of CPU cycles/time. The goal of this issue is to come up with a way to use the successful runs and possibly generate more runs on a case-by-case basis when there are failures. I had an email exchange with @candcook and she has mentioned that this is possible, but there are changes that would need to happen under the hood to support it. I was planning on discussing what was said in more detail at the next dev meeting.

sotorrio1 commented 5 years ago

Got it. Thanks for clarifying, @CSRussell2319

boverhof commented 5 years ago

Steady state solution failure

This is an example of one of the errors I saw which causes an entire iteration to be thrown out. So if there are 99 runs and one results in this error the optimization algorithm value will be set at 1000. The error message is found at the bottom of the job result.

Job output

{ "graphError" : { "N" : "1" }, "input" : { "M" : { "BFB" : { "M" : { "BFBadsB.Cr" : { "N" : "1" }, "BFBadsB.Dt" : { "N" : "11.897" }, "BFBadsB.dx" : { "N" : "0.0127" }, "BFBadsB.Lb" : { "N" : "2.085" }, "BFBadsM.Cr" : { "N" : "1" }, "BFBadsM.Dt" : { "N" : "15" }, "BFBadsM.dx" : { "N" : "0.06695" }, "BFBadsM.Lb" : { "N" : "1.972" }, "BFBadsT.Cr" : { "N" : "1" }, "BFBadsT.Dt" : { "N" : "15" }, "BFBadsT.dx" : { "N" : "0.06239700000000001" }, "BFBadsT.Lb" : { "N" : "2.2029999999999994" }, "BFBRGN.Cr" : { "N" : "1" }, "BFBRGN.Dt" : { "N" : "9.041" }, "BFBRGN.Lb" : { "N" : "8.886" }, "BFBRGNTop.Cr" : { "N" : "1" }, "BFBRGNTop.Dt" : { "N" : "9.195" }, "BFBRGNTop.Lb" : { "N" : "7.1926" }, "dp" : { "N" : "0.00018" }, "fg_flow" : { "N" : "100377" }, "GHXfg.A_exch" : { "N" : "16358" }, "GHXfg.GasIn.P" : { "N" : "1.01325" }, "GHXfg.GasIn.T" : { "N" : "54" }, "Kd" : { "N" : "100" } } }, "graph" : { "M" : { } } } }, "nodeError" : { "M" : { "BFB" : { "N" : "21" } } }, "nodeSettings" : { "M" : { "BFB" : { "M" : { "Allow Simulation Warnings" : { "BOOL" : true }, "homotopy" : { "N" : "0" }, "Initialize Model" : { "BOOL" : false }, "Max consumer reuse" : { "N" : "90" }, "Max Status Check Interval" : { "N" : "5" }, "Maximum Run Time (s)" : { "N" : "840" }, "Maximum Wait Time (s)" : { "N" : "1440" }, "Min Status Check Interval" : { "N" : "4" }, "MinStepSize" : { "N" : "0.001" }, "Override Turbine Configuration" : { "S" : "NULL" }, "printlevel" : { "N" : "0" }, "Reset" : { "BOOL" : false }, "Reset on Fail" : { "BOOL" : true }, "Retry" : { "BOOL" : false }, "RunMode" : { "S" : "Steady State" }, "Script" : { "S" : "NULL" }, "Snapshot" : { "S" : "NULL" }, "TimeSeries" : { "L" : [ { "N" : "0" } ] }, "TimeUnits" : { "S" : "Hours" }, "Visible" : { "BOOL" : false } } } } }, "output" : { "M" : { "BFB" : { "M" : { "BFBRGN.GasIn.F" : { "N" : "1357.0814214980398" }, "Cost_ads" : { "N" : "34830765.96283022" }, "Cost_aux_power" : { "N" : "23378.969592384507" }, "Cost_coe" : { "N" : "138.85964658190665" }, "Cost_coe_obj" : { "N" : "138.85964658190665" }, "Cost_op_cooling_water" : { "N" : "12577963.846287617" }, "Cost_op_cooling_water_flow" : { "N" : "2927442.0442882474" }, "Cost_op_fixed" : { "N" : "54526718.57072218" }, "Cost_op_var" : { "N" : "172437413.51951027" }, "Cost_rgn" : { "N" : "16321781.413333291" }, "Cost_shx" : { "N" : "39020395.75661277" }, "Cost_steam_power" : { "N" : "179203.44430650023" }, "Cost_steam_tot" : { "N" : "696035.1169336186" }, "Cost_toc" : { "N" : "2111440309.1504824" }, "Cost_toc_sorb" : { "N" : "83874472.31479213" }, "F_solids" : { "N" : "12681737.021801885" }, "GHXfg.HXIn.F" : { "N" : "135788.8395015208" }, "removalCO2" : { "N" : "0.8999964373466817" }, "removalCO2_slack" : { "N" : "0" }, "SHX.CWFR" : { "N" : "26890.328479471482" }, "SHX.LeanIn.T" : { "N" : "146.3820219819483" }, "SHX.LeanOut.T" : { "N" : "70.57864346590641" }, "SHX.RichIn.T" : { "N" : "83.99312572585319" }, "SHX.RichOut.T" : { "N" : "170" }, "SHX.SteamFR" : { "N" : "37683.43278161038" }, "slugab_slack" : { "N" : "0" }, "slugam_slack" : { "N" : "0" }, "slugat_slack" : { "N" : "0" }, "slugrb_slack" : { "N" : "0" }, "slugrt_slack" : { "N" : "0" }, "status" : { "N" : "0" } } }, "graph" : { "M" : { "error" : { "N" : "1" } } } } }, "solTime" : { "N" : "402.2090001106262" }, "turbineMessages" : { "M" : { "BFB" : { "S" : "[\"event=setup,consumer=1af3ba18-faf8-48e4-845d-adf527c82bab\", \"working directory setup finished\", \"sinter read setup finished\", \"event=running,consumer=1af3ba18-faf8-48e4-845d-adf527c82bab\", \"sinter inputs sent, running simulation\", \"Real Run failed, runStatus=si_SIMULATION_ERROR\", \"event=error,consumer=1af3ba18-faf8-48e4-845d-adf527c82bab,msg=\\\"Error: si_SIMULATION_ERROR: \\r,Starting run at 01:30:40\\r,A sub-group in the decomposition failed to solve\\r,Steady state solution failure\\r,Run terminated at 01:36:18\\r,,\\\"\"]" } } }}

CSRussell2319 commented 5 years ago

The main issue here (runs failing and causing an entire iteration to fail) can be solved using statistical techniques that would create new samples based on the results from the previous run. This does not guarantee success, but it reduces computation time while trying to maintain a space-filling design. The issue itself isn't a bug, but rather a self-check to ensure that the samples are actually space-filling.

Consistent convergence errors could indicate issues with the bounds set on the initial parameters and could happen even if the new statistical method was introduced. It requires user input to resolve. Following issue #232 there should be, at a minimum, an error notification that informs the user why optimization was cancelled by FOQUS (too many failed iterations), where the results are stored (csv file in working directory), what needs to be done to fix the issue (adjust the boundaries for the selected parameters), and how the data will inform boundary adjustment.

As a long term goal we should consider incorporating a new sampling method that can take historical runs and add new samples intelligently to maintain a space-filling design. I don't think anyone currently has time for this so it will probably not happen.

ksbeattie commented 5 years ago

Given that the long term solution, described above by @CSRussell2319, isn't going to be implemented anytime soon, we're lowing the priority on this and removing it from the Feb release board.

ksbeattie commented 4 years ago

This is a subset of #676

boverhof commented 2 months ago

no progress, outdated

CCSI-Toolset / FOQUS

FOQUS OUU failures: how to handle? #326

Steady state solution failure

Job output