NCAR / CUPiD

CUPiD is a “one stop shop” that enables and integrates timeseries file generation, data standardization, diagnostics, and metrics from all CESM components.
https://ncar.github.io/CUPiD/
Apache License 2.0
25 stars 24 forks source link

Adding a few rof notebooks #126

Closed nmizukami closed 1 week ago

nmizukami commented 3 months ago

Initial commits for ROF notebooks.

An ultimate set of the notebooks intend to mimic old ROF diagnostic plots

This PR is just starting with a few notebooks.

All Submissions:

New Feature Submissions:

  1. [x] Does your submission pass tests?
  2. [x] Have you lint your code locally prior to submission?

Changes to Core Features:

TeaganKing commented 3 months ago

Hey @nmizukami ! Thanks so much for adding these notebooks! I made those changes we discussed to cupid-run and the config file to include rof in running cupid and in the jupyter book table of contents.

One note: I think we'll need to provide math in cupid-analysis if you prefer to use that for sqrt rather than numpy in metrics.py? I don't think this should be a problem, but just let me know if you'd like me to do that?

Would you be able to pull these changes in locally, test running your notebook with cupid-run and make sure that things look as you expect?

nmizukami commented 3 months ago

Hi Teagan (@TeaganKing), The notebook almost ran with cupid-run -rof. One error was reading geopackage file (gis vector data) via geopanda.

You can see /glade/work/mizukami/CUPiD/examples/coupled_model/month_annual_flow.ipynb.

cupid prints on screen this:

RROR 1: PROJ: proj_create_from_database: /glade/u/apps/casper/23.10/spack/opt/spack/proj/8.2.1/gcc/12.2.0/7gif/share/proj/proj.db contains DATABASE.LAYOUT.VERSION.MINOR = 2 whereas a number >= 3 is expected. It comes from another PROJ installation.

I have seen this before. I don't fully understand the error, but this is coming from pyogrio package that came with geopandas. pyogrio is used internally in geopandas.

When I ran the notebook outside cupid, it runs fine but I activated Python [conda-env:cupid-analysis] environment in the jupyterhub. I see another one called cupid-analysis, which I believe cupid actually uses. I saw similar error when I use cupid-analysis. wondering what is the difference between Python [conda-env:cupid-analysis] and cupid-analysis?

hopefully I am simply setting something e.g., environment incorrectly...

TeaganKing commented 3 months ago

Hi @nmizukami , Sorry I let this slip! In the environment in which this was working, did you have a particular version of geopandas or pyogrio pinned? I could also add that to the environment yaml specification. Or when you previously ran into this error, did you have another solution?

This error may be because PROJ is already installed-- I'm not sure where at this point, but can look into that.

nmizukami commented 3 months ago

Hi @TeaganKing, some hint is that I can ran outside cupid-run, meaning I can run the notebook manually on jupyterhub with [conda-env:cupid-analysis] env on, but NOT with cupid-analysis on (get similar error on PROJ). You see two similar envs in Jupyter in image below. I believe the package versions should be ok. I can think about this more... I don't know what is the difference between [conda-env:cupid-analysis] and cupid-analysis

Screen Shot 2024-08-27 at 12 07 05 PM

TeaganKing commented 3 months ago

It sounds like there may be some issue related to the ipykernel installation. I think one of these might be the installation from ipykernel (a soft linked conda environment) and the other may be a conda environment found elsewhere (possibly an outdated cupid-analysis that doesn't include geopandas?). Mike mentioned that the ipykernel installation basically creates a softlink to an environment, which made me think that could be an inconsistency.

I had updated a test environment but not my actual cupid-analysis environment; I'm doing that now and will test your notebook out. This is probably not the most efficient workflow, but I wonder if it might also be worth removing your cupid-analysis environment, see if it's still listed as an option in JupyterHub, make sure that both versions are removed, and then re-install a clean version?

nmizukami commented 3 months ago

I did the following steps to remove cupid-analysis env and reinstall it on terminal.

mamba remove --name cupid-analysis --all
mamba env create -f environments/cupid-analysis.yml

It did not fix it. After removing cupid-analysis, jupyterhub still showed cupid-analysis, though [conda-env:cupid-analysis] was gone.

nmizukami commented 3 months ago

Hi @TeaganKing , trying to run conda list to see what packages are there in cupid-analysis env when running cupid-run. So unfortunately including conda list in the notebook cause error in cupid run:

SyntaxError: An error happened when checking the source code. 
:25:7: invalid syntax

conda list
nmizukami commented 3 months ago

casper-login1:/glade/work/mizukami/CUPiD/examples/coupled_model (main_adding_rof)> cupid-run -rof

/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py:455: UserWarning: 
========================================================================================= DAG render with warnings =========================================================================================
----------------------------------------------------------------- NotebookRunner: index -> File('computed_notebook...ucture/index.ipynb') ------------------------------------------------------------------
----------------------------------------------------------------- /glade/work/mizukami/CUPiD/examples/nblibrary/infrastructure/index.ipynb -----------------------------------------------------------------
These parameters are not used in the task's source code: 'CESM_output_dir', 'lc_kwargs', 'serial', and 'subset_kwargs'
----------------------------------------------------------- NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb') ------------------------------------------------------------
---------------------------------------------------------------- /glade/work/mizukami/CUPiD/examples/nblibrary/rof/month_annual_flow.ipynb -----------------------------------------------------------------
These parameters are not used in the task's source code: 'CESM_output_dir', 'lc_kwargs', 'serial', and 'subset_kwargs'
============================================================================================ Summary (2 tasks) =============================================================================================
NotebookRunner: index -> File('computed_notebook...ucture/index.ipynb')
NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb')
========================================================================================= DAG render with warnings =========================================================================================

  warnings.warn(str(warnings_))
Executing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.03cell/s]
Building task 'month_annual_flow':  50%|███████████████████████████████████████████████████████████████████                                                                   | 1/2 [00:02<00:02,  2.92s/itERROR 1: PROJ: proj_create_from_database: /glade/u/apps/casper/23.10/spack/opt/spack/proj/8.2.1/gcc/12.2.0/7gif/share/proj/proj.db contains DATABASE.LAYOUT.VERSION.MINOR = 2 whereas a number >= 3 is expected. It comes from another PROJ installation.
                                                                                                                                                                                                           /glade/u/apps/opt/conda/condabin/conda                                                                                                                                      | 5/69 [00:20<03:39,  3.44s/cell]
Executing:  90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍               | 62/69 [03:53<00:26,  3.76s/cell]
Building task 'month_annual_flow': 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [03:56<00:00, 118.06s/it]
Traceback (most recent call last):
  File "/glade/work/mizukami/conda-envs/cupid-dev/bin/cupid-run", line 8, in <module>
    sys.exit(run())
             ^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/CUPiD/cupid/run.py", line 290, in run
    dag.build()
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py", line 557, in build
    report = callable_()
             ^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py", line 662, in _build
    raise build_exception
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/dag/dag.py", line 591, in _build
    task_reports = self._executor(dag=self, show_progress=show_progress)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/work/mizukami/conda-envs/cupid-dev/lib/python3.11/site-packages/ploomber/executors/serial.py", line 203, in __call__
    raise DAGBuildError(str(exceptions_all))
ploomber.exceptions.DAGBuildError: 
============================================================================================= DAG build failed =============================================================================================
----------------------------------------------------------- NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb') ------------------------------------------------------------
---------------------------------------------------------------- /glade/work/mizukami/CUPiD/examples/nblibrary/rof/month_annual_flow.ipynb -----------------------------------------------------------------
---------------------------------------------------------------------------
Exception encountered at "In [24]":
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 2
      1 column_stat = []
----> 2 gauge_shp_all_case = gauge_shp.copy(deep=True)
      3 for case, grid_name in cases.items():
      4     gauge_shp_all_case = gauge_shp_all_case.merge(
      5         gauge_shp1[case][["id", f"{error_metric}_{grid_name}"]],
      6         left_on="id",
      7         right_on="id",
      8     )

NameError: name 'gauge_shp' is not defined

ploomber.exceptions.TaskBuildError: Error when executing task 'month_annual_flow'. Partially executed notebook available at /glade/work/mizukami/CUPiD/examples/coupled_model/computed_notebooks/quick-run/rof/month_annual_flow.ipynb
ploomber.exceptions.TaskBuildError: Error building task "month_annual_flow"
============================================================================================= Summary (1 task) =============================================================================================
NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb')
============================================================================================= DAG build failed =============================================================================================
nmizukami commented 3 months ago

Hi @TeaganKing, Small good new is that I got it run without the geopanda error. The trick is to add this os.environ['PROJ_LIB']='/glade/work/mizukami/conda-envs/cupid-analysis/share/proj' before loading geopandas. However, I don't think this is permanent solution. I still try to consult with CISL.

I was able to create /glade/work/mizukami/CUPiD/examples/coupled_model/computed_notebooks/quick-run/_build/html/index.html How do you usually open under HPC. I was trying to open firefox in derecho/casper, but it is very slow. Wonder if there is any other ways to look.

TeaganKing commented 3 months ago

Hi @nmizukami , I'm glad that is temporarily working (but of course we need this to work for any user's environment). Yes, I think this would be a good conversation to have with CISL.

Regarding looking at output, see the second section on this page for recommendations on NCAR machines.

TeaganKing commented 2 months ago

Hey @nmizukami , I added a PR to bring rof into run.py. And then I realized these changes are already in this PR... so apologies-- feel free to ignore that!

TeaganKing commented 2 months ago

To-do:

TeaganKing commented 2 months ago

Per conversation with @nmizukami , this is now working when he cloned the cupid package, installed both of the python environments, and ran create_conda_kernel under cupid-analysis, instead of python -m ipykernel install --user --name=cupid-analysis.

I also updated the readme to include rof and fixed a number of issues from pre-commit checks.

Could you please fix these two errors from pre-commit checks in examples/nblibrary/rof/scripts/metrics.py? If metrics.py is not currently being used, it would also be great to either 1) remove this script and add it in when the notebooks that are going to use it are included in CUPiD, OR 2) fix the scripts to make sure they work and include them in this PR.

examples/nblibrary/rof/scripts/metrics.py:363:26: F821 undefined name 'myCount'
examples/nblibrary/rof/scripts/metrics.py:424:26: F821 undefined name 'myCount'

After updating my conda environment, I ran into the following rof-related errors. I'll plan to look into these issues in more detail (and do a more thorough review of the notebooks + scripts) once jupyterhub is back up. It seems like they could be a permission-related error where the dictionaries are not being properly populated and thus the keys are incorrect.

--------------------------------------------------------------------------- NotebookRunner: month_annual_flow -> File('computed_notebook..._annual_flow.ipynb') ----------------------------------------------------------------------------
-------------------------------------------------------------- /glade/work/tking/cupid_project/other_cupids/naoki_cupid/CUPiD/examples/nblibrary/rof/month_annual_flow.ipynb ---------------------------------------------------------------
---------------------------------------------------------------------------
Exception encountered at "In [12]":
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[12], line 6
      3 gauge_plot[gname] = {}
      4 for case, _ in cases.items():
      5     gauge_ix, seg_index, seg_id = find_seg_id(
----> 6         site_id, reachID[case], gauge_reach_lnk[case]
      7     )
      8     gauge_plot[gname][case] = [gauge_ix, seg_index, seg_id]

KeyError: 'f09_f09_mg17_mosart'

ploomber.exceptions.TaskBuildError: Error when executing task 'month_annual_flow'. Partially executed notebook available at /glade/work/tking/cupid_project/other_cupids/naoki_cupid/CUPiD/examples/coupled_model/computed_notebooks/quick-run/rof/month_annual_flow.ipynb
ploomber.exceptions.TaskBuildError: Error building task "month_annual_flow"
---------------------------------------------------------------------------- NotebookRunner: ocean_discharge -> File('computed_notebook...an_discharge.ipynb') -----------------------------------------------------------------------------
--------------------------------------------------------------- /glade/work/tking/cupid_project/other_cupids/naoki_cupid/CUPiD/examples/nblibrary/rof/ocean_discharge.ipynb ----------------------------------------------------------------
---------------------------------------------------------------------------
Exception encountered at "In [16]":
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[16], line 34
     32     id_list = gauge_shp1[case]["route_id"].values
     33 else:
---> 34     id_list = gauge_shp1[case][gauge_shp1[case]["ocean"] == ocean_name][
     35         "route_id"
     36     ].values
     38 reach_index = get_index_array(reachID[case], id_list)
     39 seas_data_vector = seas_data[case][q_name].stack(seg=("lat", "lon"))

KeyError: 'f09_f09_mg17_mosart'
nmizukami commented 2 months ago

did I mess this up? I was rebasing NCAR/main and also nmizukami/main_adding_rof to get my local to up-to-date.

TeaganKing commented 2 months ago

Thanks for fixing that! These are now running smoothly on my end with those updated permissions!

nmizukami commented 2 months ago

In month_annual_flow.ipynb, I also have a few additional comments (but the file was too large to render/comment on particular lines): All sections: Can you please remove empty cells? I think the ‘go back to top’ sections could be removed if this is being run all at once?

Removed empty cells and removed ‘go back to top’ link.

Section 1. Are the lists of cases intended to be regularly used options? Is the yaml file that’s loaded going to be consistent, or is it intended to be updated by user?

Yes, case is CESM case, which CESM modeler put and this has to be provided by the user. I tried adding some comments what need to be provided here. Multiple cases are allowed to compare the results from different cases

2.1 – there’s a typo ‘Mmonthly’

Fixed.

2.6 - can you add a docstring to define what’s happening in this function?

This function is removed.

3.1 – I think this cell may have an empty slice error.

I think I know how this happens. I think this is due to the inconsistent observation data length across the sites (some sites provide the data for a short period). No observed flow data is available depending analysis (or simulation) periods. Can I fix this in another PR?

TeaganKing commented 2 months ago

Thanks for addressing some of these comments!

I made issue ticket #140 for the empty slice error-- that's fine by me to address in another PR as long as the notebook otherwise runs smoothly (which it does) and users are informed (by the issue ticket) of the error that needs to be fixed.

Ok, I wanted to make sure that we are using the cases values in the config.yml file and that users don't actually need to update anything in the notebook once the cell is tagged as a parameters cell.

mnlevy1981 commented 1 month ago

I started to look at this, and I have a lot of questions and suggestions -- @nmizukami and @TeaganKing could we try to find a time early next week to meet? I'd like to discuss a few things that might be tough to squeeze into an in-line review on this PR. Some initial comments, though (and maybe these changes will make it a little easier to go back through line-by-line asynchronously):

  1. It looks like this is failing the pre-commit style checks
  2. I noticed in month_annual_flow.ipynb you have a logical flag parallel that enables using PBSCluster when set to true. If you look at examples/nblibrary/ocn/ocean_surface.ipynb you'll see how CUPiD already passes a serial flag and uses a LocalCluster when that is set to false... I haven't looked at the other runoff notebooks, but we need to avoid casper- or derecho-specific blocks of code
  3. The notebooks in examples/nblibrary/rof should not have any output in them
nmizukami commented 1 month ago

Hi Mike (@mnlevy1981) and Teagan (@TeaganKing), yes, I can meet on Monday or Tuesday (my calendar is up-to date). I was wondering about PBSCluster (2nd point)

nmizukami commented 1 month ago

To-do

nmizukami commented 1 month ago

Updated key_metrics/config.yml and coupled_model/config.yml for rof modify two notebooks based on config changes so now they run.

Review is needed and some science questions came up (e.g., what to do if you plot for time period when no observation is available. Are the other notebooks comparing the model outputs with observations??)

TeaganKing commented 1 month ago

Hey @nmizukami , thanks for these updates.

Not all notebooks are comparing with observations, but you can see an example of an observational comparison in the glacier notebook & corresponding config.yml details.

I think that if you are plotting for a time period where observations are not available, perhaps a warning statement that the obs are unavailable would be useful?

TeaganKing commented 1 month ago

And I'll review after our discussion on Thursday.

nmizukami commented 1 month ago

Right now I am pointing to case /glade/campaign/cesm/development/cross-wg/diagnostic_framework/CESM_output_for_testing/b.e23_alpha16b.BLT1850.ne30_t232.054

The time period for this case is year 0001-0102, when for sure there is no observation for any components. So I thought this config is meant to compare the simulation with some base simulation, to see the model comparison or something like that, not meant to validate the model component with observations.

Just wanted to understand the context of this setup. just with current config, the rof notebooks look less interesting, but technically the notebook works now (I believe).

If config point to any CESM cases that use 20th-21st century, rof notebook automatically adds the observed streamflow to the plots, and compare the simulations with observations.

TeaganKing commented 1 month ago

The key setup here that's different from the coupled-model example is in the 'global params' section of the config file, where we have both a case name for the case you're looking at, as well as the base_case_name for a comparison case. The observations are defined separately in each individual notebook config section at this point.

TeaganKing commented 1 month ago

That sounds good that plots are generated without obs if obs do not exist.

TeaganKing commented 1 month ago

@nmizukami is planning to do the following:

Once these items are done, @TeaganKing can review.

nmizukami commented 1 month ago

@nmizukami is planning to do the following:

  • [x] implement comparison with base_case in addition to obs
  • [x] remove years that overwrite config file start/end years
  • [x] include analysis-period configuration parameter to specify e.g. 10 years so that users don't need to run 100 years unless they really want to do so.
  • [x] test cupid-run from key_metrics directory

Once these items are done, @TeaganKing can review.

Hi @TeaganKing, all are done!

TeaganKing commented 1 month ago

Hey @nmizukami , thanks for all your work on this! It's definitely getting much closer!

I made a few in-line comments above and also wanted to remind you of @mnlevy1981 's comment from above: "I noticed in month_annual_flow.ipynb you have a logical flag parallel that enables using PBSCluster when set to true. If you look at examples/nblibrary/ocn/ocean_surface.ipynb you'll see how CUPiD already passes a serial flag and uses a LocalCluster when that is set to false... I haven't looked at the other runoff notebooks, but we need to avoid casper- or derecho-specific blocks of code". This same structure is also currently used in ocean_discharge.ipynb. You can see details on the current serial vs parallel implementation in our documentation.

Also, this is a minor comment, but would you be able to rename your notebooks according to our new suggested naming convention: <region>_<variable>_<metric>_<comparisons>.ipynb (eg, something similar to Global_PSL_NMSE_compare_obs_lens.ipynb or Greenland_SMB_visual_compare_obs.ipynb)

TeaganKing commented 1 month ago

It looks like the notebooks ran in key_metrics in about 15 minutes. Both notebooks seem to be running smoothly. One additional comment regarding the figures-- in month_annual_flow.ipynb, there are a number of figures that are saved but not directly plotted in the notebook. For the purposes of generating the juptyerbook, I think it'd be best to show the plots directly in the notebook.

In the coupled_model example, in Cell 5, there was an error with no files to open. It may be best to leave the key_metrics example, and take out rof from the coupled_model example since we'll need to update that configuration file anyways and create an 'additional metrics' example.

nmizukami commented 1 month ago

Hey @nmizukami , thanks for all your work on this! It's definitely getting much closer!

I made a few in-line comments above and also wanted to remind you of @mnlevy1981 's comment from above: "I noticed in month_annual_flow.ipynb you have a logical flag parallel that enables using PBSCluster when set to true. If you look at examples/nblibrary/ocn/ocean_surface.ipynb you'll see how CUPiD already passes a serial flag and uses a LocalCluster when that is set to false... I haven't looked at the other runoff notebooks, but we need to avoid casper- or derecho-specific blocks of code". This same structure is also currently used in ocean_discharge.ipynb. You can see details on the current serial vs parallel implementation in our documentation.

I believe I have upadted both notebooks so they use LocaCluster now like this. Are you looking at the old ones??

client = None
if serial:
    cluster = LocalCluster(**lc_kwargs)
    client = Client(cluster)

client

Also, this is a minor comment, but would you be able to rename your notebooks according to our new suggested naming convention: <region>_<variable>_<metric>_<comparisons>.ipynb (eg, something similar to Global_PSL_NMSE_compare_obs_lens.ipynb or Greenland_SMB_visual_compare_obs.ipynb)

How about these?

month_annual_flow.ipynb -> global_discharge_gauge_comparison.ipynb
ocean_discharge.ipynb -> global_discharge_ocean_comparison.ipynb

For comparison part, it plots observations, but it is not available (because of analysis year), it is not plotted.

what does mean (is it different from ?)

nmizukami commented 1 month ago

It looks like the notebooks ran in key_metrics in about 15 minutes. Both notebooks seem to be running smoothly. One additional comment regarding the figures-- in month_annual_flow.ipynb, there are a number of figures that are saved but not directly plotted in the notebook. For the purposes of generating the juptyerbook, I think it'd be best to show the plots directly in the notebook.

15min-> did you run all the components? It seems to take too long. I put a 10 year period (rof_start_data and rof_end_date) in config.yml now, so should be read faster and reading netcdfs is a bottleneck now.

name               Ran?      Elapsed (s)    Percentage
-----------------  ------  -------------  ------------
index              True          9.39778       10.1186
month_annual_flow  True         55.4196        59.6702
ocean_discharge    True         28.0591        30.2112

Regarding no inline figures from cells - actually those cells are not running without observations available. I was not sure what to do with this situation. If there are observations, they should run and produce the figures.

In the coupled_model example, in Cell 5, there was an error with no files to open. It may be best to leave the key_metrics example, and take out rof from the coupled_model example since we'll need to update that configuration file anyways and create an 'additional metrics' example.

TeaganKing commented 1 month ago

Thanks for clarifying-- the way you are using LocalCluster works!

For the notebook names, would these work? Or if you prefer, maybe you could just take out _compare_obs if observations are only available sometimes?

month_annual_flow.ipynb -> global_discharge_gauge_compare_obs.ipynb
ocean_discharge.ipynb -> global_discharge_ocean_compare_obs.ipynb

RE plotting in line-- I think that's fine if the plots show up when observations are available. Since the default is save_figs=False, I think this is reasonable.

Did you want to take out the coupled_model config file changes, per our discussion on updating the config file to a similar format as the key_metrics config file?

TeaganKing commented 1 month ago

Thanks for making these changes, @nmizukami ! These notebook names are clearer now. I also removed the jupyter book build bit for the rof notebooks that are no longer being computed in the coupled model config file. Both notebooks run on the order of 1min now!

TeaganKing commented 1 month ago

I think the only outstanding comment I'm concerned about is the inclusion of personal directories-- which may need to be pushed to a future issue ticket when we have a more clear location for hosting observational datasets. @mnlevy1981 did you want to take a quick look at this, as well?

nmizukami commented 2 weeks ago

I spent some time trying to reduce the number of lines in metrics.py before realizing you never import it.

Thanks for looking at the scripts. Actually it is not imported. I was going to use Cell [19] where bias, RMSE, and corr are computed in global_discharge_gauge_compare_obs.ipynb.

I put this metrics.py even though they are not used. This computes error metrics, also streamflow metrics (high flow, low flow extreme events etc.), which might be useful (at least use for hydrology, river often).

I can leave this out if confusing now, and add it when (if) need later.

Also, global_discharge_gauge_compare_obs.ipynb ran successfully but

  1. Cell [14] (in 3.2. Annual cycle (at monthly step)) is the last cell that produces output; cells 15 - 21 have the %%time output but no plots or tables

Yes, Cell 15 through 21 are skipped if there are no observations available. For this case, the simulation period year 0001 - 0101, so it is outside the time period when observation is available. these cells compute error against observations.

in those cells, the second line if obs_available: -> obs_available is False if there is no observation available.

  1. Did some metadata cells get moved around? I see the following sections in order:

    3. Analysis for 24 large rivers
    3.1 Annual flow series
    3.2. Annual cycle (at monthly step)
    3.3. scatter plots of monthly flow
    3.4. scatter plots of annual flow
    4. Anaysis for Large 50 rivers
    4.1 Summary tables
    4.2. scatter plot of annual mean flow
    
    3. Anaysis for all 922 sites
    3.1 Compute metris at all the sites (no plots nor tables)
    4.2. Spatial metric map
    4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)

Yes, this needs to be updated. But i have been wondering if this is needed?? Also right now each section has a link to the corresponding cell. I don't think the link is not useful here?

(Maybe the reason I could delete metrics.py is because the cells that should call those functions aren't being invoked correctly?)

I haven't had a chance to look at colors.py or utility.py yet, and if it turns out metrics.py is supposed to be called I can go look closer to see if more functions can be replaced with existing numpy calls

utility.py and colors.py are imported (but not all the functions), but metrics.py is not called.

So what I could do is to include the functions used only? These scrips are from the scripts I use for my other notebooks, so if these become useful, could be added later.

mnlevy1981 commented 2 weeks ago

2. Did some metadata cells get moved around? I see the following sections in order:

    3. Analysis for 24 large rivers
    3.1 Annual flow series
    3.2. Annual cycle (at monthly step)
    3.3. scatter plots of monthly flow
    3.4. scatter plots of annual flow
    4. Anaysis for Large 50 rivers
    4.1 Summary tables
    4.2. scatter plot of annual mean flow

    3. Anaysis for all 922 sites
    3.1 Compute metris at all the sites (no plots nor tables)
    4.2. Spatial metric map
    4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)

Yes, this needs to be updated. But i have been wondering if this is needed?? Also right now each section has a link to the corresponding cell. I don't think the link is not useful here?

I would definitely encourage you to have some markdown cells that describe what plots are being generated. I don't think they need to be organized like a paper with sections and subsections. I hadn't noticed the links from the top of the notebook to each specific section until you mentioned it... let me look at how cupid-build parses those links before we discuss keeping them or removing them.

utility.py and colors.py are imported (but not all the functions), but metrics.py is not called.

So what I could do is to include the functions used only? These scrips are from the scripts I use for my other notebooks, so if these become useful, could be added later.

Yes, please remove functions that are not being used (and then I'm happy to review whatever is left). One reason to provide modules like utility.py and colors.py is to potentially share functions with other notebooks, but if these modules contain code that isn't used anywhere then there is no guarantee that they actually work as advertised.

nmizukami commented 2 weeks ago
  1. Did some metadata cells get moved around? I see the following sections in order:
  2. Analysis for 24 large rivers 3.1 Annual flow series 3.2. Annual cycle (at monthly step) 3.3. scatter plots of monthly flow 3.4. scatter plots of annual flow
  3. Anaysis for Large 50 rivers 4.1 Summary tables 4.2. scatter plot of annual mean flow
  4. Anaysis for all 922 sites 3.1 Compute metris at all the sites (no plots nor tables) 4.2. Spatial metric map 4.3 Boxplots of Error metrics (RMSE, %bias, and correlation coefficient)

Yes, this needs to be updated. But i have been wondering if this is needed?? Also right now each section has a link to the corresponding cell. I don't think the link is not useful here?

I would definitely encourage you to have some markdown cells that describe what plots are being generated. I don't think they need to be organized like a paper with sections and subsections. I hadn't noticed the links from the top of the notebook to each specific section until you mentioned it... let me look at how cupid-build parses those links before we discuss keeping them or removing them.

utility.py and colors.py are imported (but not all the functions), but metrics.py is not called. So what I could do is to include the functions used only? These scrips are from the scripts I use for my other notebooks, so if these become useful, could be added later.

Yes, please remove functions that are not being used (and then I'm happy to review whatever is left). One reason to provide modules like utility.py and colors.py is to potentially share functions with other notebooks, but if these modules contain code that isn't used anywhere then there is no guarantee that they actually work as advertised.

Ok, I will clean up the scripts (sorry for confusing you.) and the markdown cells.

nmizukami commented 2 weeks ago

Hi @mnlevy1981, I think the scripts are clean now, and fixed the errors you pointed out in the notebooks. Also markdown cells are update. The notebooks still runs ok, though I saw lots of exceptions from the dask distributed.

TeaganKing commented 1 week ago

Once the comments regarding ancillary data variable naming and personal directories are resolved, we can merge this.

nmizukami commented 1 week ago

Hi @TeaganKing and @mnlevy1981 , I have updated the notebooks and setup.yaml based on your comments. thanks!

TeaganKing commented 1 week ago

Great, thanks so much @nmizukami ! This looks good to me. I think merging is blocked due to Mike's change request-- @mnlevy1981 can you please check that this addressed your comments as well and then merge this in?