Feature request: GPS obs metadata

kdraeder commented 2 years ago

Use case

It can be useful to identify some provenance of each GPS ob in obs_seq.out files.

Is your feature request related to a problem?

The current instructions for creating GPS obs_seq.out files are to use obs from all platforms (COSMIC, COSMIC2, tdx, ...) available for the desired dates on the CDAAC site. The preferred "mode" is "postProc" or "reprocessed", but in some cases only "near real time" (nrt) are available. In some cases those obs degrade the assimilation. The obs are not identified by platform in the obs_seq.out files, which adds a lot of work to figuring out which platform(s) is causing the problem. In addition, CDAAC occasionally adds obs to the existing data sets, which adds a layer of complexity to comparing old and new obs_seq.out files.

Describe your preferred solution

I'm opening this issue as a discussion because I don't know what the scope of the issue(s) should be. I'll describe the ideal solution here. Below I'll describe what I've done in the reanalysis branch, and some other possibilities.

Ideally there would be additional metadata with a very short description of the data source of each ob. It would be most useful if it has the platform name and the data "mode" (nrt, postProc, reprocessed, ...?) or abbreviations, as found in the CDAAC site. It might also be useful to have the date it was added to CDAAC, but that may be hard to provide, due to a lack of documentation on the CDAAC site andor the need to extract that information from the raw files.

Describe any alternatives you have considered

In the reanalysis branch (not in github yet) I did the following: Modify convert_cosmic_gps_cdf.f90 to replace the obs error variance with a constant (negative) value, which is different for each platform. Make it read that value from a new namelist variable. Create an obs_seq.out file from each platform, each having a different platform flag. These can't be used in assimilations, they're just for learning the platform of an ob. Then there were 2 uses.

Identify the obs in the assimilation which have the largest biases. Look those up in the platform specific obs_seq.out files.
If a new GPS obs_seq.out is causing problems that on old one didn't, compare the (time ordered) list of GPS obs used in the assimilation with the list of platform labeled obs, to see which are "extra" and may be causing the problems. To do this it was helpful to modify program obs_selection to save both the input obs that are found in the selector file, and those that aren't, in separate files.

I plan to commit the programs and scripts developed for this debugging somewhere, probably to the reanalysis branch first.

So I have questions about how to proceed.

Is it worth pursuing the ideal solution; adding metadata to GPS obs?
If not, is any of the kluge solution worth putting into the main branch?
Should we change the instructions for creating GPS obs_seq.out files to recommend doing a test with each platform (maybe just NRT?) before using it in a production obs_seq.out?
Is it worth removing "cosmic" and "COSMIC" from places where the obs are not COSMIC; metadata, program and script names and contents?
Should the metadata include the date of inclusion of the ob?

hkershaw-brown commented 2 years ago

just so I'm following this, explain to me mode and platform

preferred "mode" is "postProc" or "reprocessed", but in some cases only "near real time" (nrt)

platform COSMIC, COSMIC2

So you would want two pieces of information, e.g.: platform: COSMIC mode: reprocessed

nancycollins commented 2 years ago

note that all GPS obs already have 6 numbers of metadata plus a string. the kicker is that these numbers are never used in the standard forward operator. they were only needed for the version of the forward operator that interpolated along the ray. (we use the version that assigns the observation to the center point of the ray between the emitting and observing GPS satellite.) they are a huge waste of memory. i'm happy to support a change where those numbers are dropped and more useful metadata is added.

here's a proposal: we don't preallocate the ray integration arrays, we allocate them (and grow them) on demand if needed. the code can read only the initial string to distinguish between the variations of metadata. existing observations will have one of the two existing strings. the code can use the numbers for the integration version and discard them for the usual version. a new string could indicate there is additional metadata that describes the platform and file processing level. those arrays can grow on demand. we will have to figure out what happens for missing metadata, however.

the alternative is to make new observation types for the observations with different metadata. that will replicate code in the forward operator which is still needed for existing files. i don't know what is the cleanest here.

but here's probably the most important consideration. (sorry for burying the lede.) you'll have to think a bit about how you want the diagnostics to work. do you want to mix old and new gps obs? select only by platform? the diagnostics can't select by metadata so if you're interested in diagnostics that let you see summary statistics you'd be looking at making an observation type per platform. probably better to write a simple obs selection program based on obs_loop that lets you subset obs based on the metadata, and then run the result through obs_diag. it's worth thinking about the final usage here before making a decision on how to implement this.

kdraeder commented 2 years ago

Platform = the satelite or instrument that took the measurement mode = the level of processing of the data by CDAAC after the measurement

near real time = minimal processing and error checking, to make it available as soon as possible.
post processed = more quality control
reprocessed = may have some reanalysis involved to do even more quality control and evaluation of the obs. Platform would be the most useful. Mode would also be helpful for identifying the more suspect obs. The date the ob was included in the CDAAC file would be useful during debugging, but it's hard to judge yet whether it's worth the effort.

Nancy, good points about the backwards compatibility, final use, and the unused metadata.

I'm not in favor of defining a new obs_type for each new GPS platform. There seem to be a growing number and different selection every year, and some of them are short-lived. Most of the time users will just want to see all the GPS obs lumped together, on the assumption that they've been vetted before being included in the obs_seq.out files. During that vetting, or later when problems occasionally come up, it seems reasonable to go to the extra work to separate them by metadata in a separate step, especially if the work is minimized by having useful metadata in the files.

hkershaw-brown commented 2 years ago

note from Jeff at today's standup: there are people who are using the along the ray fwd operator.

Added back burner label since Kevin does not need this feature at the moment.

kdraeder commented 2 years ago

Jeff also reminded us that the obs_seq file architecture is designed to enable adding metadata. He sees no need to spend time on this at this time. So I propose committing the programs and scripts, which work around the lack of platform metadata in the obs_seq files, to the reanalysis branch and closing this issue. In case this is reopened in the future here are some notes describing them, and some other commands that were useful. No one needs to read below here at this time.

assimilation_code/programs/obs_selection/

obs_selection.f90
obs_selection.nml

Added ability to separate all input obs into selected and not selected, in separate files. Previously not selected obs were ignored. Added a namelist variable; the file name of the not selected obs.

observations/obs_converters/gps/

convert_cosmic_gps_cdf.f90
convert_cosmic_gps_cdf.nml
work/input.nml

Added a kluge to tag each ob with information about the platform or some other characteristic that is not included in the metadata. This is done by specifying a constant (negative) value for the obs error variance, so the resulting file cannot be used for assimilation. It can be used in the (modified) obs_selection tool.

observations/obs_converters/reanalysis/gps/shell_scripts/

kdr_multi_parallel.batch

Replaced provision of input.nml with the use of an input.nml.template. Added comments about cosmic2 observations in early 2020.

observations/obs_converters/reanalysis/gps/shell_scripts/

kdr_single_platform.batch

A new script to handle creating an obs_seq file which has a platform tag replacing the obs error variance. For use with the modified convert_cosmic_gps_cdf.f90.
uniq_times.csh

A new script to tally the number of obs at each unique time. TODO? This is slow. It should probably be a fortran program.
platforms.csh

A script to extract the platforms (fake obs err var) of a list of times. The list of times can come from uniq_times.csh. The obs file needs to have platform tags in the obs err variance from convert_cosmic_gps_cdf.nml:use_constant_err=-$flag

Also useful: uniq_times.csh reads a file created from

set pre_day  = the day before the one we want to summarize
set post_day = the day after the one we want to summarize
grep -B 2 OBS obs_seq.out | \
grep -v -e OBS -e '^[ ]*-' -e '--' -e first -e $pre_day \
 -e $post_day -e COSMIC  >! obs_seq.out.times

Output from platforms.csh goes into obs_seq.out.times.uniq.platforms

$ wc -l obs_seq.out.times.uniq.platforms
       2610 obs_seq.out.times.uniq.platforms
$ grep -e '-6' obs_seq..times.uniq.platforms | wc -l
       2610

'-6' was the flag for COSMIC2 obs.

hkershaw-brown commented 2 years ago

The kludges cause incorrect assimilations, correct? Because of the negative variance? If so I would recommend committing the kludges (any code that would cause a filter run to break) in a separate directory with notes on what the code does, .e.g

DART/gps_observation_problems
    | - convert_cosmic_gps_cdf.f90
    | - README

Or if you want to commit the files in place, include in the commit some lines in quickbuild.csh that stop you building filter:

 echo "DO NOT USE THIS, because ... 
 exit 666

kdraeder commented 2 years ago

Helen, good point. What do you think about a line in convert_cosmic_gps_cdf which changes the name of the output file to gpsro_out_file)//'.neg_err_var' if the error variance has been set to a flag value (negative, non-MISSING_R8) ? Then the user will know that the output files have that in them and shouldn't be used for assimilation. I've tested this and it works. And it allows convert_cosmic_gps_cdf to also function exactly as before, using either the kuo or gsi error variances if the namelist says to. What I've tested is in ~raeder/DART/reanalysis_git/observations/obs_converters/gps and /glade/scratch/raeder/gps_conversion/Test_neg_tsx/20200101.

I think that preventing the build of filter (in a different work directory) would be excessive and unnecessary, since we'd only want to prevent it if the wrong obs_seq files were going to be used some time in the future.

hkershaw-brown commented 2 years ago

no strong opinion on this, whatever works for you.

kdraeder commented 2 years ago

I pushed the changes listed above to the reanalysis branch. Should I issue a PR to tie that to this issue?

hkershaw-brown commented 2 years ago

Nope. You're working on the reanalysis branch.

NCAR / DART