Ouranosinc / xclim

Library of derived climate variables, ie climate indicators, based on xarray.
https://xclim.readthedocs.io/en/stable/
Apache License 2.0
333 stars 59 forks source link

[PyOpenSci] Reviewer #2 comments #1335

Closed Zeitsperre closed 1 year ago

Zeitsperre commented 1 year ago

Originally posted by @jmunroe in https://github.com/pyOpenSci/software-submission/issues/73#issuecomment-1484520663

This issue is a tickbox summary of comments from the reviewer that seemed addressable in the near-term.

The foreword to the review:

Thank so much for your patience with me. I have now gone through the package carefully following the PyOpenSci review instructions. This package is an incredibly thorough treatment of climate based metrics and I found it a real pleasure to review. A few issue I identified are presented below but none of the them are serious and I recommend this package be accepted by PyOpenSci.

I hope that in acceptance by PyOpenSci (and hopefully by JOSS as well), xclim will continue to be extended and adopted by a wide international community of climate scientists and professionals. This is an important body of work that should empower those trying to understand, mitigate, and adapt to our changing climate.

Documentation

TJS: This is addressed in #1338

TJS: This is addressed in #1338

TJS: This is addressed in #1338

Functionality

TJS: The reason why this is occurring is because we made significant changes to our xclim-testdata repository in recent versions. I realize now that this is breaking because we aren't tagging explicit versions/commits of the testdata that are guaranteed to work. I'm thinking that we might want to start doing that from now on, rather than always point at master. @aulemahal, what do you think? Update: This is addressed in #1339

TJS: pylint is configured but we do not currently pass those compliance checks (run with allowed failure). If the amount of effort to get us passing is reasonable, I'll attempt to get this working.


Review Comments

Installation notes
conda create -n my_xclim_env python=3.8 --file=environment.yml
conda activate my_xclim_env
pip install ".[dev]"

And there I hit my first issue:

CondaValueError: could not parse 'name: xclim' in: environment.yml

The fix (at least for conda 22.11.1) is that --file is an option to pass to conda env create and not conda create. This needs to be fixed in the install instructions.

TJS: This is addressed in #1338

I confess I tend to get confused when there is the option of using either environment.yml and requirements_*.txt files. So, I skipped the instructions following 'Extra Dependencies' in the documentation. I assume there must be situtations when I should and should not install these extra dependencies but as a new user of the package, I don't what those situations are yet. Since theses installation instructions are right near the top of the documenation, perhaps it would be better for the maintainers to make those choices for me? For example, I am now wondering "should I be installing flox?". Since it is 'highly recommended', would it not make more sense to have it as part of the default instructions?

TJS: This is addressed in #1338

Basic Usage
# ds = xr.open_dataset("your_file.nc")
ds = open_dataset("ERA5/daily_surface_cancities_1990-1993.nc")
ds.tas

My initial reading of this code made me think that this ERA5 dataset was something I need to first download locally (I did not distinguish between xr.open_dataset and open_dataset in my very first glance at the code). After some review, I see now that there companion GitHub repo that was available that had testing data and the xclim.testing API automatically makes a locally cached copy of this file. I think it would be clearer if this very first example was written out as

# ds = xr.open_dataset("your_file.nc")
ds = xclim.testing.open_dataset("ERA5/daily_surface_cancities_1990-1993.nc")
ds.tas

so that it was clear that the open_dataset was utility method of the testing framework for xclim.

TJS: This is addressed in #1338

In the example of Health checks and metadata attributes there is a typo:

gdd = xclim.atmos.growing_degree_days(tas=ds6h.tas, thresh="10.0 degC", freq="MS")

should be

gdd = xclim.atmos.growing_degree_days(tas=ds6h.air, thresh="10.0 degC", freq="MS")

TJS: This is addressed in #1338

While in-code comments are generally fine, these last few examples on graphics feel tacked on given the strong narrative text established in the beginning of the Basic Usage section of the documentation.

TJS: This is addressed in #1338

Examples
Workflow Examples

Minor spelling error in the docs:

TJS: This is addressed in #1338

Usually, xclim users are encouraged to use the subsetting utilities of the clisops package. Here, we will reduce the size of our data using the methods implemented in xarray

This is confusing because, as the first example workflow, the user has not yet been shown to use the clisops package. Should there be a subsub-section immediately before such as Subsetting and selecting data with cliops to demonstrate that recommended workflow?

TJS: This is addressed in #1338

# import plotting stuff
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = (11, 5)

that leads to the warning

/tmp/ipykernel_7039/887583071.py:5: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use("seaborn")

I think the offending line should be changed to

plt.sytle.use("seaborn-v0_8")

(and elsewhere in the documenation where seaborn styles are used)

TJS: This is addressed in #1338

hw_before.sel(time="2010-07-01").plot(vmin=0, vmax=7)
plt.title("Resample, then run length")
plt.figure()
hw_after.sel(time="2010-07-01").plot(vmin=0, vmax=7)
plt.title("Run length, then resample")

TJS: This is addressed in #1338

# The tasmin threshold is 15°C for the northern half of the domain and 20°C for the southern half.
# (notice that the lat coordinate is in decreasing order : from north to south)
thresh_tasmin = xr.DataArray(
    [7] * 24 + [11] * 24, dims=("lat",), coords={"lat": ds5.lat}, attrs={"units": "°C"}
)
# The tasmax threshold is 16°C for the western half of the domain and 19°C for the eastern half.
thresh_tasmax = xr.DataArray(
    [17] * 24 + [21] * 24, dims=("lon",), coords={"lon": ds5.lon}, attrs={"units": "°C"}
)

don't appear to match the values used in the code. I assume the code comments just need to be updated.

PB: This is addressed in #1338

Ensemble-Reductinon Techniques
# Create 2d xr.DataArray containing criteria values
crit = None
for h in ds_crit.horizon:
    for v in ds_crit.data_vars:
        if crit is None:
            crit = ds_crit[v].sel(horizon=h)
        else:
            crit = xr.concat((crit, ds_crit[v].sel(horizon=h)), dim="criteria")
crit.name = "criteria"

Is this "criteria" array effectively the equivalent of creating a feature matrix used in data science?

TJS: This is addressed in #1341

Ensemble-Reductinon Techniques
Statistical Downscaling and Bias-Adjustment

A more complex example could have bias distribution varying strongly across months. To perform the adjustment with different factors for each months, one can pass group='time.month'. Moreover, to reduce the risk of sharp change in the adjustment at the interface of the months, interp='linear' can be passed to adjust and the adjustment factors will be interpolated linearly. Ex: the factors for the 1st of May will be the average of those for April and those for May.

TJS: Many typos and grammatical errors have been addressed in #1338

The previous notebook covered the most common utilities of xclim.sdba for conventional cases

TJS: Many typos and grammatical errors have been addressed in #1338

aulemahal commented 1 year ago

Nice! I didn't read everything but indeed, back in the day, @tlvu warned us of using another git repo for the testing data. On one hand, I'm not convinced we need to be able to run tests on older versions. In theory, at the time of release, all tests were passing, no? On the other one, I realize we use the testing data in the notebooks and not being able to reproduce those seems more problematic to me.

Thus indeed, I guess that tagging a testdata version would help solve this! (Long, I tagged you in cas you had any advice? This comment refers to the first box of section Functionality above)

tlvu commented 1 year ago

My previous worries about splitting the testdata with the code https://github.com/Ouranosinc/xclim-testdata/pull/1#issuecomment-704281014

So tagging the testdata should solve this reproducibility issue.

But to ensure smooth dev workflow, the code should allow overriding the tag with a branch name. During dev cycle, both the testdata and the code would most probably move together. Without the override capability, we will have to continuously tag the testdata so it can be used with the code and this can get tedious.

However, with this tag override capability, we must not forget to tag the final version of the testdata and bump that tag on the code side before merging. The tag should be the default value when no override is used.

Zeitsperre commented 1 year ago

@tlvu

Thanks for the suggestion on how best to proceed for this. We now have a testing data tagging scheme and some GitHub Actions to prevent us from accidentally breaking it during some development branch. It's nothing fancy, but we should be able to more easily test older versions of xclim going forward.

Zeitsperre commented 1 year ago

Except for #1342, we managed to address all major comments in under a work-week. Nicely done, team!