eurec4a / eurec4a-intake

Intake catalogue for EUREC4A field campaign datasets
17 stars 19 forks source link

Tests for netcdf plugins #9

Closed RobertPincus closed 4 years ago

RobertPincus commented 4 years ago

I should have started with a branch instead of committing to master directly, my apologies.

The CI tests appear to be failing during item.discover(). This works on my local installation but the CI is complaining about a missing module "aiohttp". I may try to add this next.

d70-t commented 4 years ago

I think this commit in intake-xarray could break your files. It is supposed to speed up the access to remotely stored netcdf files by downloading and caching them, but that doesn't work this way for opendap resources.

Which version of intake-xarray are you using on your system?

We are using the most recent master of intake-xarray as earlier versions only allowed to access opendap when authenticated via ESGF, but not by other means. But it seems like the current version also broke the possibility of accessing opendap via the netcdf driver. Personally, I didn't realise that it was possible to access OPeNDAP via the netcdfdriver, but it is documented and kind of makes a lot of sense. On the other hand, I see that caching files is also a desireable feature of intake-xarray. And I think it is not possible to guess if a dataset should be accessed by downloading a netcdf file or via opendap, only by looking at the URL.

leifdenby commented 4 years ago

Looking at this a bit more I think the issue is something to do with accessing local files... https://github.com/eurec4a/eurec4a-intake/runs/977998243#step:5:251

This would need a bit more digging, but if it's working locally for you then it should work on the CI platform. It probably is a package inconsistency somewhere...

RobertPincus commented 4 years ago

@d70-t @leifdenby I manage my local environment within conda:

intake                    0.6.0                      py_0
intake-xarray             0.3.1                      py_0

If we are too bleeding-edge with the catalog or the CI tests we run the risk of making the catalog less useful. Aeris no longer requires authentication - might we revert to the current distribution of intake-xarray? Or is the issue elsewhere?

leifdenby commented 4 years ago

thanks @RobertPincus. I've been looking into to this a bit more. It appears that the netcdf intake "driver" provided by intake-xarray should work for fetching netCDF data via OPeNDAP, if that OPeNDAP server doesn't require authentication (list of intake plugins and the drivers they provide). I didn't realise this - it's pretty confusing that netcdf driver is for local netCDF files and for OPeNDAP without authentication, and opendap is for when an OPeNDAP server requires authentication... Anyway, that's why I made a pull-request for changes to intake-xarray to make it possible to provide authentication information (otherwise intake-xarray only works with the ESGF and URS servers which are defined in pydap).

The above means we don't need to use my fork of intake-xarray, but we should replace all use of the opendap driver with netcdf when accessing data from non-authenticated OPeNDAP backend. Unfortunately @d70-t, this doesn't appear to work with your server that is serving the SPECMACS data at the moment. You check this with these examples:

>> import xarray as xr
>> ds = xr.open_dataset('http://macsserver.physik.uni-muenchen.de/products/dap/eurec4a/cloudmask/EUREC4A_HALO_specMACS_cloud_mask_20200119T092900-20200119T184859_v1.1.nc')
...
OSError: [Errno -68] NetCDF: I/O failure: b'http://macsserver.physik.uni-muenchen.de/products/dap/eurec4a/cloudmask/EUREC4A_HALO_specMACS_cloud_mask_20200119T092900-20200119T184859_v1.1.nc'
# try with example from xarray docs
>> ds = xr.open_dataset("http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods", decode_times=False)
>> len(ds)
4

Also @RobertPincus I'm getting an error 400 for the URLs mentioned in this pull-request (e.g. http://psl.noaa.gov/thredds/dodsC/Datasets/ATOMIC/data/p3/AXBT/Level_3/P3_AXBT_Level_3_v0.7.nc), maybe you moved the files since making this pull-request?

I suggest that

1) @d70-t you get your OPeNDAP server working with xarray, and then we 2) remove my fork and use the netcdf driver for non-authenticated OPeNDAP access (which will be all data sources now that AERIS doesn't require authentication) and put this on master (I can do this once you give me the go-ahead @d70-t ), 3) @RobertPincus if you could check that the URLs you're provide are still valid and then you could merge in master from this repo once we've removed all use of the opendap driver 4) we merge your pull-request @RobertPincus and get the P3 files into the catalog.

Sound ok?

leifdenby commented 4 years ago

@d70-t I started a branch to replace use of the opendap driver with netcdf here: https://github.com/leifdenby/eurec4a-intake/commit/f6480443ca366b9ef7c3ca56eef4434723b7a4a2

RobertPincus commented 4 years ago

@leifdenby Thanks this sounds terrific. Many URLs are likely to change today and tomorrow so this timing is perfect. Hoping @d70-t can get the SpecMACS server working well.

d70-t commented 4 years ago

One thing which still bothers me is that I think that the've broken the ability to open OPeNDAP resources with the netcdf driver of intake-xarray with the commit mentioned above, as that uses fsspec to cache the files, which does not seem to be compatible with OPeNDAP. My fear is that while we should access the datasets using netcdf in the current version of intake-xarray, the upcoming version might introduce a breaking change which would force us to use opendap in stead.

I'll have a look at my server anyways.

RobertPincus commented 4 years ago

@d70-t I worry about this too. It seems to me that one should be able to use either OpenDAP or netcdf plugins with remote files. I moved to using netcdf because the opendap plugin was failing because I hadn't set ESGF authentication credentials.

d70-t commented 4 years ago

@leifdenby while you should use https on the macsserver, I still can not reproduce the error from above on my machine:

In [1]: import xarray as xr 
   ...: ds = xr.open_dataset('https://macsserver.physik.uni-muenchen.de/products/dap/eurec4a/cloudmask/EUREC4A_HALO_specMACS_cloud_mask_20200119T092900-20200119T184859_v1.1.nc')                          

In [2]: len(ds)                                                                                                                                                                                            
Out[2]: 9
d70-t commented 4 years ago

I've filed an issue in intake-xarray. Let's see if the've got a suggestion.

leifdenby commented 4 years ago

@leifdenby while you should use https on the macsserver, I still can not reproduce the error from above on my machine:

Thanks for checking this. I think you're right. I've found out that opening the dataset directly with xarray using the URL to the opendap server works on my laptop (linux), but not on my workstation (macos). I've been trying to dig into this, but I think for now we should move ahead with switching to using the netcdf driver.

@leifdenby Thanks this sounds terrific. Many URLs are likely to change today and tomorrow so this timing is perfect. Hoping @d70-t can get the SpecMACS server working well.

Yes, I was hoping we could get this fixed in time for the ATOMIC hackathon. @d70-t should I make a pull-request from the branch I made changing the driver to netcdf? Or will you make a pull-request we can merge?

d70-t commented 4 years ago

This is a bit of a troublesome situation. Yes, it would be awesome to have this sorted out for the ATOMIC hackathon, but I am a bit hesitant of pushing the catalog with netcdf drivers into master as long as we don't know if that behaviour will stay available for longer.

From my perspective, the pro arguments for changing to netcdf are:

The counter arguments against it are:

I think I would prefer to make this problematic situation explicit to the user scripts and create a separate branch which contains references to the netcdf based catalogs. In case of us updating the opendap catalogs, the old scripts will remain functional and in case of the users updating intake-xarray, that could remind them of the situation such that it may be easier to fix this properly as soon as a more permanent solution is available.

RobertPincus commented 4 years ago

@d70-t @leifdenby Very few people from the ATOMIC hackathon will use the intake catalog so don't feel pressure to merge on that account.

d70-t commented 4 years ago

I've diverted this PR to the netcdf_driver branch (which is based on Leif's changes) for now. I assume that github will pick this up later and will run the tests on the P3 datasets in an environment which provides a fair chance of succeeding.

I've another comment: the PSL server is redirecting all HTTP requests to HTTPS, so I'd suggest to directly use https:// for all URLs in stead.

RobertPincus commented 4 years ago

@leifdenby @d70-t Wow, now the CI is in trouble from the very start. I don't know how to interpret this, I'm sorry.

d70-t commented 4 years ago

@RobertPincus this is actually not to bad... first it tells you that there is only one failing test. This is because it can't find a sub catalog. That is the WSRA, which is still spelled inconsistently as mentioned above.