Closed ksbeattie closed 2 months ago
I don't get it. How is this related to OUU?
The issue occurs when running OUU. The problem is that the current implementation will toss out all completed runs in an iteration if even one of them results in a failure. This would waste a considerable amount of CPU cycles/time. The goal of this issue is to come up with a way to use the successful runs and possibly generate more runs on a case-by-case basis when there are failures. I had an email exchange with @candcook and she has mentioned that this is possible, but there are changes that would need to happen under the hood to support it. I was planning on discussing what was said in more detail at the next dev meeting.
Got it. Thanks for clarifying, @CSRussell2319
This is an example of one of the errors I saw which causes an entire iteration to be thrown out. So if there are 99 runs and one results in this error the optimization algorithm value will be set at 1000. The error message is found at the bottom of the job result.
{ "graphError" : { "N" : "1" }, "input" : { "M" : { "BFB" : { "M" : { "BFBadsB.Cr" : { "N" : "1" }, "BFBadsB.Dt" : { "N" : "11.897" }, "BFBadsB.dx" : { "N" : "0.0127" }, "BFBadsB.Lb" : { "N" : "2.085" }, "BFBadsM.Cr" : { "N" : "1" }, "BFBadsM.Dt" : { "N" : "15" }, "BFBadsM.dx" : { "N" : "0.06695" }, "BFBadsM.Lb" : { "N" : "1.972" }, "BFBadsT.Cr" : { "N" : "1" }, "BFBadsT.Dt" : { "N" : "15" }, "BFBadsT.dx" : { "N" : "0.06239700000000001" }, "BFBadsT.Lb" : { "N" : "2.2029999999999994" }, "BFBRGN.Cr" : { "N" : "1" }, "BFBRGN.Dt" : { "N" : "9.041" }, "BFBRGN.Lb" : { "N" : "8.886" }, "BFBRGNTop.Cr" : { "N" : "1" }, "BFBRGNTop.Dt" : { "N" : "9.195" }, "BFBRGNTop.Lb" : { "N" : "7.1926" }, "dp" : { "N" : "0.00018" }, "fg_flow" : { "N" : "100377" }, "GHXfg.A_exch" : { "N" : "16358" }, "GHXfg.GasIn.P" : { "N" : "1.01325" }, "GHXfg.GasIn.T" : { "N" : "54" }, "Kd" : { "N" : "100" } } }, "graph" : { "M" : { } } } }, "nodeError" : { "M" : { "BFB" : { "N" : "21" } } }, "nodeSettings" : { "M" : { "BFB" : { "M" : { "Allow Simulation Warnings" : { "BOOL" : true }, "homotopy" : { "N" : "0" }, "Initialize Model" : { "BOOL" : false }, "Max consumer reuse" : { "N" : "90" }, "Max Status Check Interval" : { "N" : "5" }, "Maximum Run Time (s)" : { "N" : "840" }, "Maximum Wait Time (s)" : { "N" : "1440" }, "Min Status Check Interval" : { "N" : "4" }, "MinStepSize" : { "N" : "0.001" }, "Override Turbine Configuration" : { "S" : "NULL" }, "printlevel" : { "N" : "0" }, "Reset" : { "BOOL" : false }, "Reset on Fail" : { "BOOL" : true }, "Retry" : { "BOOL" : false }, "RunMode" : { "S" : "Steady State" }, "Script" : { "S" : "NULL" }, "Snapshot" : { "S" : "NULL" }, "TimeSeries" : { "L" : [ { "N" : "0" } ] }, "TimeUnits" : { "S" : "Hours" }, "Visible" : { "BOOL" : false } } } } }, "output" : { "M" : { "BFB" : { "M" : { "BFBRGN.GasIn.F" : { "N" : "1357.0814214980398" }, "Cost_ads" : { "N" : "34830765.96283022" }, "Cost_aux_power" : { "N" : "23378.969592384507" }, "Cost_coe" : { "N" : "138.85964658190665" }, "Cost_coe_obj" : { "N" : "138.85964658190665" }, "Cost_op_cooling_water" : { "N" : "12577963.846287617" }, "Cost_op_cooling_water_flow" : { "N" : "2927442.0442882474" }, "Cost_op_fixed" : { "N" : "54526718.57072218" }, "Cost_op_var" : { "N" : "172437413.51951027" }, "Cost_rgn" : { "N" : "16321781.413333291" }, "Cost_shx" : { "N" : "39020395.75661277" }, "Cost_steam_power" : { "N" : "179203.44430650023" }, "Cost_steam_tot" : { "N" : "696035.1169336186" }, "Cost_toc" : { "N" : "2111440309.1504824" }, "Cost_toc_sorb" : { "N" : "83874472.31479213" }, "F_solids" : { "N" : "12681737.021801885" }, "GHXfg.HXIn.F" : { "N" : "135788.8395015208" }, "removalCO2" : { "N" : "0.8999964373466817" }, "removalCO2_slack" : { "N" : "0" }, "SHX.CWFR" : { "N" : "26890.328479471482" }, "SHX.LeanIn.T" : { "N" : "146.3820219819483" }, "SHX.LeanOut.T" : { "N" : "70.57864346590641" }, "SHX.RichIn.T" : { "N" : "83.99312572585319" }, "SHX.RichOut.T" : { "N" : "170" }, "SHX.SteamFR" : { "N" : "37683.43278161038" }, "slugab_slack" : { "N" : "0" }, "slugam_slack" : { "N" : "0" }, "slugat_slack" : { "N" : "0" }, "slugrb_slack" : { "N" : "0" }, "slugrt_slack" : { "N" : "0" }, "status" : { "N" : "0" } } }, "graph" : { "M" : { "error" : { "N" : "1" } } } } }, "solTime" : { "N" : "402.2090001106262" }, "turbineMessages" : { "M" : { "BFB" : { "S" : "[\"event=setup,consumer=1af3ba18-faf8-48e4-845d-adf527c82bab\", \"working directory setup finished\", \"sinter read setup finished\", \"event=running,consumer=1af3ba18-faf8-48e4-845d-adf527c82bab\", \"sinter inputs sent, running simulation\", \"Real Run failed, runStatus=si_SIMULATION_ERROR\", \"event=error,consumer=1af3ba18-faf8-48e4-845d-adf527c82bab,msg=\\\"Error: si_SIMULATION_ERROR: \\r,Starting run at 01:30:40\\r,A sub-group in the decomposition failed to solve\\r,Steady state solution failure\\r,Run terminated at 01:36:18\\r,,\\\"\"]" } } }}
The main issue here (runs failing and causing an entire iteration to fail) can be solved using statistical techniques that would create new samples based on the results from the previous run. This does not guarantee success, but it reduces computation time while trying to maintain a space-filling design. The issue itself isn't a bug, but rather a self-check to ensure that the samples are actually space-filling.
Consistent convergence errors could indicate issues with the bounds set on the initial parameters and could happen even if the new statistical method was introduced. It requires user input to resolve. Following issue #232 there should be, at a minimum, an error notification that informs the user why optimization was cancelled by FOQUS (too many failed iterations), where the results are stored (csv file in working directory), what needs to be done to fix the issue (adjust the boundaries for the selected parameters), and how the data will inform boundary adjustment.
As a long term goal we should consider incorporating a new sampling method that can take historical runs and add new samples intelligently to maintain a space-filling design. I don't think anyone currently has time for this so it will probably not happen.
Given that the long term solution, described above by @CSRussell2319, isn't going to be implemented anytime soon, we're lowing the priority on this and removing it from the Feb release board.
This is a subset of #676
no progress, outdated
Discussion (motivated by #321 ) on what types of errors are re-submittable and which ones are not. Can the sample size be automatically reduced, throwing out a few jobs that failed?
@CSRussell2319 will follow up with colleagues on viability of throwing failed jobs out or methods for confirming we have an accurate distribution.