barronh / pseudonetcdf

PseudoNetCDF like NetCDF except for many scientific format backends
GNU Lesser General Public License v3.0
76 stars 35 forks source link

move xarray's pseudonetcdf backend here #137

Closed jhamman closed 1 month ago

jhamman commented 1 year ago

Xarray now has a plugin system for registering backends. For the more specialized backends (like pseudonetcdf), we'd like to work toward moving the backend entrypoint to 3rd party packages. For example, rioxarray now ships with its own xarray backend. Could the same be done here?

barronh commented 1 year ago

@jhamman - I'm on board, but it may take me a bit. First, I have a few clarifying questions.

A lot of my datasets are very large, so I suspect I'll need lazy loading. Each variable is essentially a subclass of an array, which means that it has attributes and special methods, but ultimately the data access depends on the array. That means that when I use memmap's for binary files the data is only retrieved from disk as it is indexed.

So, I want to summarize my understanding -- basically, I'll echo back what I think I understand from your docs. open_dataset returns a xarray.Dataset, which has data_vars, coords, etc. Each element of data_vars and coords is a xarray.DataArray, where the data element would be wrapped in xarray.indexing.LazilyIndexedArray.

Imagining that I code this (correctly) in a module called xarray_plugin within PseudoNetCDF, would the registry look like:

setuptools.setup(
    entry_points={
        "xarray.backends": ["pseudonetcdf=PseudoNetCDF.xarray_plugin:PseudoNetCDFBackend"],
    },
)

In terms of functionality, would this change anything operationally? Like, would the user need to import both xarray and PseudoNetCDF to get access to PseudoNetCDF as a backend? Anything like that? Or does the entry point automatically load with xarray?

Lastly, do I read the my_backend_option in open_dataset has a place holder for one or many options? That is my interpretation from the rioxarray. Such that I could add as many or use **kwds to allow for many? PseudoNetCDF is a reader of many formats, some of which require "special" keywords to correctly initialize (i.e., non-self-describing datasets). So, this point will be important for me.

Thanks for helping walk me through this. I see the value in taking this approach and I love your accessor approach too.

jhamman commented 1 year ago

You've got the details all basically right.

In terms of functionality, would this change anything operationally? Like, would the user need to import both xarray and PseudoNetCDF to get access to PseudoNetCDF as a backend?

No. We should be able to make this a seamless transition for users. No extra imports would be needed as the entrypoints are automatically picked up.

I hope this is mostly a fork-lift exercise from Xarray's repo to this one. I think you can keep existing lazy-loading behavior as is.

Lastly, do I read the my_backend_option in open_dataset has a place holder for one or many options? That is my interpretation from the rioxarray. Such that I could add as many or use **kwds to allow for many? PseudoNetCDF is a reader of many formats, some of which require "special" keywords to correctly initialize (i.e., non-self-describing datasets). So, this point will be important for me.

xr.open_dataset has **kwargs that are passed on to the backend engine. You should be able to use this route to pass any options on to pseudonetcdf.

jhamman commented 1 year ago

Quick bump on this. @barronh - have you had a chance to look into this at all?

barronh commented 1 year ago

Sorry, I'll make this happen by Apr 8.

barronh commented 1 year ago

@jhamman - If it really is a fork-lift, I'm really sorry it took me so long. I have started the process in https://github.com/barronh/pseudonetcdf/tree/xarray-backend. For now, I literally copied and pasted from xarray and then ran through my flake8 requirements, which are a little stricter on the line length.

Right now, I have a few questions.

  1. from __future__ import annotations seems like a 3.7+ python feature. This is a little problematic for me since I have been working in 3.6 and would prefer not to give up that backward compatibility. I'm not familiar with the annotations future feature. Do you know if/how it is being used in the current backend code in xarray or can I get rid of it?
  2. Not totally sure how to test this. Since pseudonetcdf is already a backend, how do I know that the new implementation is being picked up. a. My current thought is to install xarray and pseudonetcdf where the backend name has been changed (e.g., pnc). b. Then, I could run tests with the backend set to the old name (pseudonetcdf) and the new name (pnc)

So, to test, I started with a fresh venv with Python3.8. I installed numpy, scipy, netcdf4, pandas, matplotlib, and xarray with pip using the --prefer-binary option. Then, I installed PseudoNetCDF from the branch where I changed the backend name from pseudonetcdf to pnc in the setup.py.

import xarray as xr

f1 = xr.open_dataset('TestData/MCIP/GRIDCRO2D.12US1.35L.160101.nc')
f2 = xr.open_dataset('TestData/MCIP/GRIDCRO2D.12US1.35L.160101.nc', engine='pseudonetcdf')
f3 = xr.open_dataset('TestData/MCIP/GRIDCRO2D.12US1.35L.160101.nc', engine='pnc')

To make sure the test was sound, I also tried changing the engine from 'pnc' to 'xc' and got the following error:

ValueError: unrecognized engine xc must be one of: ['netcdf4', 'scipy', 'pnc', 'pseudonetcdf', 'store']

This felt good because pnc and pseudonetcdf were separately listed. I've also set it up such that the xarray_plugin module is only loaded if explicitly imported, so pseudonetcdf can still be used without xarray.

So, if you agree this is working, I'd like to figure out if the from __future__ import annotations option is really necessary.

Assuming this was merged, what else would be necessary on my side?

jhamman commented 1 year ago

Sounds like great progress @barronh. I'm glad to hear it turned out to be a fairly straightforward process for you.

from future import annotations seems like a 3.7+ python feature. This is a little problematic for me since I have been working in 3.6 and would prefer not to give up that backward compatibility. I'm not familiar with the annotations future feature. Do you know if/how it is being used in the current backend code in xarray or can I get rid of it?

I would first experiment with leaving this out of your implementation and see if your type checker complains.

If that doesn't work, Xarray is only supporting Python 3.9+ so if you have completely isolated this import from the rest of pseudonetcdf, this may need to be feature only supported by modern Python versions.

Not totally sure how to test this. Since pseudonetcdf is already a backend, how do I know that the new implementation is being picked up.

@snowman2 or @headtr1ck may have some ideas. https://github.com/pydata/xarray/pull/7523 recently allowed you to refresh the list of current backends so you may be able to do something like the test you explain above in your test suite.

Assuming this was merged, what else would be necessary on my side?

We would just add a deprecation warning on the Xarray side and then remove the engine in a few months (like we just did here). User's wouldn't notice the difference though so long as we keep the engine name as pseudonetcdf.

headtr1ck commented 1 year ago

I think it should be easy to make this backend python 3.6 compatible. You probably will have to change some typing, but it is not really important since the backend is usually wrapped by xarrays again. The typing of the backend was mostly meant for internal consistency, so feel free to even remove it completely.

If there is an external backend it takes precedence over an internal one. How one can test this, I'm not sure. Maybe call xarrays list_engines and test if the module of the backend is pseudonetcdf and not xarray?

For testing, I think the best approach would be to copy all tests related to this backend as well, but ofc xarray requires python >=3.9, therefor also these tests.

barronh commented 1 month ago

This has been completed in v3.4.0.