GEOS-ESM / swell

Workflow system for coupled data assimilation applications
https://geos-esm.github.io/swell/
Apache License 2.0
14 stars 4 forks source link

`get_channels` utilities are fragile and throw uninformative errors #411

Closed ashiklom closed 3 weeks ago

ashiklom commented 1 month ago

When running Tier 1 tests, I frequently hit an error like this:

"/gpfsm/dnb33/ashiklom/swell/src/swell/swell_develop/src/swell/tasks/run_jedi_hofx_executable.py", line 117, in execute
    jedi_dictionary_iterator(jedi_config_dict, self.jedi_rendering, window_type,
  File "/gpfsm/dnb33/ashiklom/swell/src/swell/swell_develop/src/swell/utilities/run_jedi_executables.py", line 48, in jedi_dictionary_iterator
    jedi_dictionary_iterator(value, jedi_rendering, window_type, obs,
  File "/gpfsm/dnb33/ashiklom/swell/src/swell/swell_develop/src/swell/utilities/run_jedi_executables.py", line 75, in jedi_dictionary_iterator
    obs_dict = jedi_rendering.render_interface_observations(ob)
  File "/gpfsm/dnb33/ashiklom/swell/src/swell/swell_develop/src/swell/utilities/render_jedi_interface_files.py", line 206, in render_interface_observations
    avail_channels, active_channels = get_channels(self.observing_system_records_path,
  File "/gpfsm/dnb33/ashiklom/swell/src/swell/swell_develop/src/swell/utilities/get_channels.py", line 91, in get_channels
    active_channels_list = process_channel_lists(active_channels)
  File "/gpfsm/dnb33/ashiklom/swell/src/swell/swell_develop/src/swell/utilities/get_channels.py", line 28, in process_channel_lists
    if '-' in element:
TypeError: argument of type 'NoneType' is not iterable

Here's the relevant source:

https://github.com/GEOS-ESM/swell/blob/develop/src/swell/utilities/get_channels.py#L24-L33

This is because one or more entries in channel_list is None. But the underlying issue here is missing observations/channel files (which might be caused by issues in GEOSmksi). (For reference, the root cause of this particular error was that I was using the main branch of GEOSmksirather than thedevelop` branch.)

I think we need to:

  1. Catch and report this error more accurately upstream (e.g., "Missing channels for variable X. Confirm that you are using the right version of GEOSmksi").
  2. Write some tests that check these capabilities in isolation (or maybe just start running the existing missing_obs_test.py that @asewnath already wrote as part of our code_tests suite).
asewnath commented 1 month ago
  1. Write some tests that check these capabilities in isolation (or maybe just start running the existing missing_obs_test.py that @asewnath already wrote as part of our code_tests suite).

@ashiklom I'm not following what you mean by this

ashiklom commented 1 month ago
  1. Write some tests that check these capabilities in isolation (or maybe just start running the existing missing_obs_test.py that @asewnath already wrote as part of our code_tests suite).

@ashiklom I'm not following what you mean by this

Yeah, that wasn't very clear, sorry!

Basically, I would like to see a more thorough unit test of the JediConfigRendering class. For example:

  1. Simulate some test data matching the structure of what's in GEOS_mksi, use that simulated data to initialize the JediConfigRendering, and test the various methods to make sure that everything works as expected.
  2. Again, simulate some data but with some specific errors/issues that we expect to fail in specific ways (e.g., missing files that are needed). Then, the corresponding test should check that the correct function throws the expected error (rather than some unexpected error).
  3. More generally, a good way to generate unit test cases is based on recently addressed bugs. For example, here, we have a specific bug that fails in a predictable way in cases where the wrong version of GEOS_mksi is cloned. So, we can have a specific unit test that incorrectly clones GEOS_mksi and tests that we throw the informative error added in #414.

The hardest part of this is to write the unit tests in such a way that we can test just the JediConfigRendering class quickly and efficiently (in a few seconds), without having to spend minutes and many compute resources on running full Swell workflows as we do currently with the Tier 1 tests.

asewnath commented 1 month ago

Got it, thanks for the clarification! I have actually worked on some testing that bypasses any jedi rendering and directly calls get_channels. It's called get_active_channels_test.py. I don't think it was every officially added to the code tests so I can expand it to test for GEOS_mksi clones and add it.