NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
184 stars 138 forks source link

Feature request: add QC flag instead of filter crash, when the ensemble member beyond the bounds setting in qceff_table.csv file. #681

Open hgrhgy opened 1 month ago

hgrhgy commented 1 month ago

Use case Add new qc flag to the obs_seq.final, when the ensemble member beyond the bounds setting in qceff_table.csv file.

Is your feature request related to a problem? when there are some member values out of the bounds configured in qceff_table.csv, the filter program crash with the error Smallest ensemble member less than lower bound -3.100498191521694E-004 0.000000000000000E+000.

Describe your preferred solution Maybe the observations related to update this member are not assimilated instead of filter crash. And add a new QC flag to the obs_seq.final to tell the user why the observations are not assimilated.

Describe any alternatives you have considered Or add a control options in the namelist file to tell the filter keep running when error occurs.

hkershaw-brown commented 1 month ago

Hi @hgrhgy I think this may be a bug, do you have a test case you can share that reproduces this error?

Also can you let us know:

Do you have input state that has 'out of bounds values'?

hkershaw-brown commented 1 month ago

There is a fix_bound_violations namelist option in probit_transform_nml

&probit_transform_nml fix_bound_violations = .true. /

Try this and see if it affects your run of filter.

fix_bound_violations will correct bounds violations in the transform_to_probit but only for small round off errors. However at first glance -3.100498191521694E-004 appears to be a fairly large bound violation.

hgrhgy commented 1 month ago

Thanks for replying. @hkershaw-brown The DART version is v11.0.1, the compiler is intel fortran, the precision is r8, the model is GEOSChem carbon simulation. Maybe you can set the lower bound with a large value to see if the problem is reproduced..

I had tried setting fix_bound_violations = .true., but the problem did not sloved. The table csv file is attached. qceff_table.csv

hgrhgy commented 1 month ago

if fix_bound_violations = .false. the error occurs in function bnrh_cdf in bnrh_distribution_mod.f90. And if fix_bound_violations = .true. the error occurs in function fix_bounds in probit_transform_mod.f90.

In qceff table file, I set the lower bound to zero for obs_error_info, probit_inflation, probit_state and obs_inc_info. I have checked the negative number is not from the observed states, the negative values are discard in forward operator. So the negative value may be come from the extended states, but in the qceff table file lower bound of extended states is not set.

hkershaw-brown commented 1 month ago

thanks for the update @hgrhgy

I think if the negative values are discarded, this would be a fail in the forward operator, and so that particular forward operator would not be part of the extended state (it would be skipped). we don't have the GeosChem model, but I think I can create an out-of-bounds forward operators with any model to take a closer look at what is going on.

I think either:

hkershaw-brown commented 1 month ago

reproducer: https://github.com/hkershaw-brown/DART/tree/out-of-bounds-fwd two observation lorenz_96_tracer_advection fwd operator out of bounds.

hkershaw-brown commented 3 weeks ago

hi @hgrhgy the branch https://github.com/NCAR/DART/tree/qc-for-out-of-bounds-fwd-ops has a fix to catch any fwd-operators with out-of-bounds errors. It sets the qc to DARTQC_OUT_OF_BOUNDS (41).

Can you give this a try and let me know if this solves your problem.

edit @hkershaw-brown double check bitwise on this

hgrhgy commented 3 weeks ago

Hi @hkershaw-brown , I have merged the commit bcf41d1 to my own branch, but the problem did not solved. I debug in detail by gdb, the stack is shown below. The program crashed in the same function bnrh_cdf for different reasons.

CASE 1: The inflation probit out of bounds. image

CASE 2: Then I disabled the inflation probit lower bound condition, and debug with the same break point , the error changed to : image

I supposed the tag bcf41d1 could solved the case 2, but it didn't. The new code is added at line 537 in assim_tools_module, and the crash occur at line 500 before the line the code added.
I don't know if the Failed to converge for quantile warnning has any impact on the errors. Also, I am confused about the difference of inflation bound setting in input.nml and qceff_table.csv.

Let me know if any other information is needed.

The gdb logs: gdb_log_for_state_out_of_bound.txt gdb_log_for_inflation_out_of_bound.txt

hkershaw-brown commented 2 weeks ago

the Failed to converge for quantile is not a good sign. Also, is this the same input (options and files) that gave the first reported problem of the out-of-bounds error? If so then their maybe other problems with your code. It is hard to tell without the code or the input files.

Is your code available on GitHub? If so, please provide the repository. It looks from the gdb_log output that your using code from https://github.com/apmizzi/DART_Chem rather than DART v11.0.1

before going further into this, I'd like to make sure this is something that we can reproduce with DART.

hgrhgy commented 2 weeks ago

It's the same input gave the first reported problem, the qceff_table.csv maybe changed by turn on or off the lower bound for each option (obs_error_info, probit_inflation, probit_state and obs_inc_info) to test which option caused the error.

The code is clone from the DART tag v11.0.1, then some forward operator from https://github.com/apmizzi/DART_Chem and GEOS-Chem model code are merged, so some log is in Arthur's log style.

The code is not currently on github.

hkershaw-brown commented 2 weeks ago

Hi @hgrhgy We're limited in the support we can provide for private code. I'd recommend you check that your input data respects the bounds set for the QCEFF options.

For the scientific options of the QCEFF, dart@ucar.edu is the best place to ask about this.

hgrhgy commented 1 week ago

I understand that the limited for private code. I have tried the branch qc-for-out-of-bounds-fwd-ops, the problem cant be reproduce in lorenz_96_tracer_advection model. I'll fully re-check my input data to ensure it respects the QCEFF bounds, and try to compare the difference in filtering process between the two model. Thank you very much for your help @hkershaw-brown .