Incomplete Data Reading from URL

sgdecker commented 6 years ago

Example Code

import matplotlib.pyplot as plt
import numpy as np
import netCDF4

url = ('http://thredds-test.unidata.ucar.edu/thredds/dodsC/'
       'satellite/goes16/GRB16/ABI/CONUS/Channel02/current/'
       'OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc')

ds = netCDF4.Dataset(url)

values = ds.variables['Rad'][:,:]
print(np.count_nonzero(values==-20.289911))

plt.imshow(values)
plt.show()

Problem

I am not sure if this is an issue with netCDF4, the server, or my machine, but I am getting a value on the order of 40 million printing out, and the plot shows incomplete data: plot

Each time I run the example code, I get slightly different results. The last three runs indicated 42159004, 38724464, and 44907954 missing pixels, respectively.

I would expect either a value of 0 printing out (and no missing data), or some sort of error message indicating there was trouble retrieving the data.

I am using 1.4.0

dopplershift commented 6 years ago

So I ran your code just now and got a complete image with no missing data. It's possible the data file on the server just wasn't complete. Was this the most recent file when you tried?

sgdecker commented 6 years ago

Originally, it was the most recent file, but I just reran the code and got 44037412 missing pixels.

When I try the code on a different machine (with 1.3.1), I see the same as you: No missing data.

I tried creating a new barebones conda environment (brings in netCDF4 1.4.0 by default) and I still got missing pixels. Then I tried an environment with netCDF4 1.3.1, since that worked on the other machine, and, lo and behold, there are no missing data!

Which version of netCDF4 are you running? I could try 1.4.1, which I see has been released, I guess through a manual install, if that would be useful.

dopplershift commented 6 years ago

My results were with 1.4.1 from conda-forge on my mac.

dopplershift commented 6 years ago

Works for me with 1.4.0 on Python 3.6

sgdecker commented 6 years ago

OK, I just tried a conda-forge-based environment with 1.4.1, and I am getting missing pixels. Here is the output of conda list for this environment:

# packages in environment at /home/decker/local/anaconda3/envs/nctest3:
#
# Name                    Version                   Build  Channel
blas                      1.1                    openblas    conda-forge
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.8.13            ha4d7672_0    conda-forge
certifi                   2018.8.13                py36_0    conda-forge
cftime                    1.0.1            py36h7eb728f_0    conda-forge
curl                      7.61.0               h93b3f91_1    conda-forge
cycler                    0.10.0                     py_1    conda-forge
dbus                      1.13.0               h3a4f0e9_0    conda-forge
expat                     2.2.5                hfc679d8_1    conda-forge
fontconfig                2.13.0               h65d0f4c_5    conda-forge
freetype                  2.9.1                h6debe1e_1    conda-forge
gettext                   0.19.8.1                      0    conda-forge
glib                      2.55.0               h464dc38_2    conda-forge
gst-plugins-base          1.12.5               hde13a9d_0    conda-forge
gstreamer                 1.12.5               h61a6719_0    conda-forge
hdf4                      4.2.13                        0    conda-forge
hdf5                      1.10.2               hc401514_1    conda-forge
icu                       58.2                 hfc679d8_0    conda-forge
jpeg                      9c                   h470a237_0    conda-forge
kiwisolver                1.0.1            py36h2d50403_2    conda-forge
krb5                      1.14.6                        0    conda-forge
libffi                    3.2.1                hfc679d8_4    conda-forge
libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
libgfortran               3.0.0                         1    conda-forge
libiconv                  1.15                 h470a237_2    conda-forge
libnetcdf                 4.6.1                he6cff42_7    conda-forge
libpng                    1.6.35               ha92aebf_0    conda-forge
libssh2                   1.8.0                h5b517e9_2    conda-forge
libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
libuuid                   2.32.1               h470a237_0    conda-forge
libxcb                    1.13                 h470a237_2    conda-forge
libxml2                   2.9.8                h422b904_3    conda-forge
matplotlib                2.2.3            py36h8e2386c_0    conda-forge
ncurses                   6.1                  hfc679d8_1    conda-forge
netcdf4                   1.4.1            py36h62672b6_0    conda-forge
numpy                     1.15.1          py36_blas_openblashd3ea46f_0  [blas_openblas]  conda-forge
openblas                  0.2.20                        8    conda-forge
openssl                   1.0.2o               h470a237_1    conda-forge
pcre                      8.41                 h470a237_2    conda-forge
pip                       18.0                     py36_1    conda-forge
pthread-stubs             0.4                  h470a237_1    conda-forge
pyparsing                 2.2.0                      py_1    conda-forge
pyqt                      5.6.0            py36h8210e8a_6    conda-forge
python                    3.6.6                h5001a0f_0    conda-forge
python-dateutil           2.7.3                      py_0    conda-forge
pytz                      2018.5                     py_0    conda-forge
qt                        5.6.2                hf70d934_9    conda-forge
readline                  7.0                  haf1bffa_1    conda-forge
setuptools                40.0.0                   py36_1    conda-forge
sip                       4.18                     py36_1    conda-forge
six                       1.11.0                   py36_1    conda-forge
sqlite                    3.24.0               h2f33b56_0    conda-forge
tk                        8.6.8                         0    conda-forge
tornado                   5.1              py36h470a237_1    conda-forge
wheel                     0.31.1                   py36_1    conda-forge
xorg-libxau               1.0.8                h470a237_6    conda-forge
xorg-libxdmcp             1.1.2                h470a237_7    conda-forge
xz                        5.2.4                h470a237_1    conda-forge
zlib                      1.2.11               h470a237_3    conda-forge

The environment that works on my machine:

# packages in environment at /home/decker/local/anaconda3/envs/nctest2:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl  
bzip2                     1.0.6                h14c3975_5  
ca-certificates           2018.03.07                    0  
certifi                   2018.8.13                py36_0  
curl                      7.61.0               h84994c4_0  
cycler                    0.10.0                   py36_0  
dbus                      1.13.2               h714fa37_1  
expat                     2.2.5                he0dffb1_0  
fontconfig                2.13.0               h9420a91_0  
freetype                  2.9.1                h8a8886c_0  
glib                      2.56.1               h000015b_0  
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               hb453b48_1  
hdf4                      4.2.13               h3ca952b_2  
hdf5                      1.10.1               h9caa474_1  
icu                       58.2                 h9c2bf20_1  
intel-openmp              2018.0.3                      0  
jpeg                      9b                   h024ee3a_2  
kiwisolver                1.0.1            py36hf484d3e_0  
libcurl                   7.61.0               h1ad7b7a_0  
libedit                   3.1.20170329         h6b74fdf_2  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 8.2.0                hdf63c60_1  
libgfortran-ng            7.3.0                hdf63c60_0  
libnetcdf                 4.5.0                hc9a163a_7  
libpng                    1.6.34               hb9fc6fc_0  
libssh2                   1.8.0                h9cfc8f7_4  
libstdcxx-ng              8.2.0                hdf63c60_1  
libuuid                   1.0.3                h1bed415_2  
libxcb                    1.13                 h1bed415_1  
libxml2                   2.9.8                h26e45fe_1  
matplotlib                2.2.3            py36hb69df0a_0  
mkl                       2018.0.3                      1  
mkl_fft                   1.0.4            py36h4414c95_1  
mkl_random                1.0.1            py36h4414c95_1  
ncurses                   6.1                  hf484d3e_0  
netcdf4                   1.3.1            py36h73d701b_2  
numpy                     1.15.0           py36h1b885b7_0  
numpy-base                1.15.0           py36h3dfced4_0  
openssl                   1.0.2p               h14c3975_0  
pcre                      8.42                 h439df22_0  
pip                       10.0.1                   py36_0  
pyparsing                 2.2.0                    py36_1  
pyqt                      5.9.2            py36h22d08a2_0  
python                    3.6.6                hc3d631a_0  
python-dateutil           2.7.3                    py36_0  
pytz                      2018.5                   py36_0  
qt                        5.9.6                h52aff34_0  
readline                  7.0                  ha6073c6_4  
setuptools                40.0.0                   py36_0  
sip                       4.19.8           py36hf484d3e_0  
six                       1.11.0                   py36_1  
sqlite                    3.24.0               h84994c4_0  
tk                        8.6.7                hc745277_3  
tornado                   5.1              py36h14c3975_0  
wheel                     0.31.1                   py36_0  
xz                        5.2.4                h14c3975_4  
zlib                      1.2.11               ha838bed_2

sgdecker commented 6 years ago

I don't have a Mac handy, but trying a Windows 10 machine (the other tests were on CentOS 7 machines) makes matters even more confusing: I have missing pixels with both 1.4.1 and 1.3.1.

sgdecker commented 6 years ago

As a final test for now, I took the same Windows 10 machine home, so it's on my home network rather than the campus network, and there was no change in the results.

dopplershift commented 6 years ago

So we have all this data...and I have no clue what conclusion to draw from it. 😆

sgdecker commented 6 years ago

My conclusion: Use a Mac!

Interestingly, I found a different conda environment on CentOS 7 that has missing pixels even with 1.3.1, but it has libnetcdf 4.6.1 (in common with my other broken environment). So my current working hypothesis is that the Python module isn't the issue, libnetcdf is. 4.5.0 is OK (on CentOS anyway), but 4.6.1 is not.

sgdecker commented 6 years ago

The combination netcdf4 1.3.1 and libnetcdf 4.5.0 also works on Windows 10. It appears it is libnetcdf 4.6.1 that is at fault.

dopplershift commented 6 years ago

Unfortunately for that theory (which might not be completely wrong), 4.6.1 is working fine for me here.

What happens if instead of ds.variables['Rad'][:,:] you plot ds.variables['Rad'][::4, ::4]? That will download 1/16 of the data.

sgdecker commented 6 years ago

OK, I made that tweak. Running the revised program on CentOS with 1.4.1/4.6.1 results in a hang. Even Control-C didn't give me a prompt back, but using kill did.

sgdecker commented 6 years ago

I ran the program a second time, and after 14 minutes, it finally crashes with:

syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!DOCTYPE^ HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>503 Service Unavailable</title></head><body><h1>Service Unavailable</h1><p>The server is temporarily unable to service yourrequest due to maintenance downtime or capacityproblems. Please try again later.</p></body></html>
Traceback (most recent call last):
  File "sat_data_missing_2.py", line 11, in <module>
    values = ds.variables['Rad'][::4,::4]
  File "netCDF4/_netCDF4.pyx", line 4105, in netCDF4._netCDF4.Variable.__getitem__
  File "netCDF4/_netCDF4.pyx", line 5016, in netCDF4._netCDF4.Variable._get
  File "netCDF4/_netCDF4.pyx", line 1745, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: DAP server error

sgdecker commented 6 years ago

However, I see the server is indeed down, so the above may be a red herring.

sgdecker commented 6 years ago

Now that the server is back up, I can report the program hangs on Windows 10 with 1.4.1/4.6.1 as well.

ocefpaf commented 6 years ago

@sgdecker if I just read this issue quickly and it seems that the issues are all on Windows, right? I'll try to get a Windows machine and test it.

Note that netcdf 4.6.1 does not build properly on Windows* so we had to add tons of patches and they may be the cause of the problems you are experiencing.

* We had issue building it for Python 2.7, backporting patches that fixed OPeNDAP problems, etc.

sgdecker commented 6 years ago

@ocefpaf no I am also having the issue on CentOS, which is exhibiting the same behavior for me as Windows 10: 1.3.1/4.5.0 is fine, but 1.3.1/4.6.1 and 1.4.1/4.6.1 are not.

rsignell-usgs commented 6 years ago

I think this might help us narrow this problem down.

I have two different linux machines, each with identical versions of netcdf4, libnetcdf and curl from conda-forge:

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
libnetcdf                 4.6.1                h5e45101_3    conda-forge
netcdf4                   1.4.0            py36h62672b6_1    conda-forge
curl                      7.61.0               h93b3f91_1    conda-forge

One works fine, the other displays the problem! https://gist.github.com/rsignell-usgs/f3e4452b0bd7445178291e0551aa7bea

@DennisHeimbigner does netcdf4 access of OPeNDAP URLs depend on any other packages or could there be some difference caused by internet connectivity/timeout?

I tried setting HTTP.TIMEOUT=25 in both ~/.daprc and ~/.dodsrc per this conversation but it didn't seem to have any impact.

sgdecker commented 6 years ago

I just got my hands on a Mac, and, contrary to @dopplershift 's experience, I am seeing the same behavior (with the original test program) as with the other machines I've tried: 1.4.1/4.6.1 has missing data, but 1.3.1/4.5.0 is fine.

dopplershift commented 6 years ago

This makes me think so kind of network issue is involved--I'm cheating because I'm sitting next to the data. 😁

I will say that I do see a problem with the slicing I suggested. It's not technically a hang; it just took > 45 minutes to error out with an IndexError. Which is nuts because with the plain slices it took < 45 minutes and completed with out error (or missing data) for me. I'm pretty sure that's now how server-side subsetting is supposed to work. 😳

Addendum: Oh W.T.F. 😡 (cc @lesserwhirls @jrleeman )

128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7904] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7908] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7912] HTTP/1.1" 200 220 "-" "oc4.6.1" 4
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7916] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7920] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7924] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7928] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7932] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7936] HTTP/1.1" 200 220 "-" "oc4.6.1" 4
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7940] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7944] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7948] HTTP/1.1" 200 220 "-" "oc4.6.1" 4
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7952] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7956] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7960] HTTP/1.1" 200 220 "-" "oc4.6.1" 5
165.230.171.64 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[192][4700] HTTP/1.1" 200 220 "-" "oc4.5.0" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7964] HTTP/1.1" 200 220 "-" "oc4.6.1" 3
128.117.144.149 - - [23/Aug/2018:13:11:52 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad[628][7968] HTTP/1.1" 200 220 "-" "oc4.6.1" 10

The client is requesting individual grid points, so it's making a round-trip request to download. 3. bytes. 😱 That seems...like something that should be improved.

ocefpaf commented 6 years ago

This makes me think so kind of network issue is involved

I just tested many versions of the conda package and pip and I could not figure out a pattern besides that, when I get the whole data it takes a while but when I get the incomplete version it is very fast. So yeah, it may be network related, but why some of the conda versions are "OK" returning the zeroes instead of keep waiting for the data puzzles me. Maybe something changed in latest curl, I'll keep digging...

DennisHeimbigner commented 6 years ago

Is someone in a position to run a packet sniffer on this. In particular, to look for a dropped connection? This may be related to this https://github.com/Unidata/netcdf-c/issues/1074

dopplershift commented 6 years ago

If @sgdecker (or anyone seeing the missing data) can send me their IP address (here or email), I can look at the server logs and see if I see anything.

sgdecker commented 6 years ago

I should be able to run tcpdump on my machine, but I've never used it before, so I don't know how exactly I should invoke it to get the data that would be useful. If that is what you have in mind, and there is a recipe I could follow, I can try it out.

sgdecker commented 6 years ago

My CentOS machine is 165.230.171.64

DennisHeimbigner commented 6 years ago

Sorry, I have not used a packet sniffer in a long time, so I can't help.

dopplershift commented 6 years ago

Opened #838 for issue about the strided access taking forever.

dopplershift commented 6 years ago

So, I've been able to reproduce on my system without changing my environment, by running from my home (even though I've got a 1Gb/s connection). So here's a log of my opendap download attempts:

8.42.77.33 - - [23/Aug/2018:17:39:00 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 119494799 "-" "oc4.6.1" 10010
8.42.77.33 - - [23/Aug/2018:17:43:44 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 240000223 "-" "oc4.5.0" 13751
8.42.77.33 - - [23/Aug/2018:17:47:28 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 232728623 "-" "oc4.6.1" 10016
8.42.77.33 - - [23/Aug/2018:17:51:18 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 240000223 "-" "oc4.5.0" 9032
8.42.77.33 - - [23/Aug/2018:17:51:39 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 240000223 "-" "oc4.5.0" 8587
8.42.77.33 - - [23/Aug/2018:17:55:53 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 240000223 "-" "oc4.6.1" 8549
8.42.77.33 - - [23/Aug/2018:17:56:13 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 193928279 "-" "oc4.6.1" 10005
8.42.77.33 - - [23/Aug/2018:18:00:16 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 233694335 "-" "oc4.6.1" 10012

The interesting parts to glean are that the request should return 240000223 bytes. For netcdf 4.5.0, this amount of data is always returned. For 4.6.1, this amount only seems to succeed if the time of the connection is less than, it seems, 10000 (ms?). This time effect is even more prevalent on @sgdecker 's downloads:

165.230.171.64 - - [23/Aug/2018:09:54:51 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 54211031 "-" "oc4.6.1" 10000
165.230.171.64 - - [23/Aug/2018:09:55:44 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 240000223 "-" "oc4.5.0" 27760
165.230.171.64 - - [23/Aug/2018:10:09:41 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 240000223 "-" "oc4.5.0" 28281
165.230.171.64 - - [23/Aug/2018:10:29:06 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 84729167 "-" "oc4.6.1" 9999
165.230.171.64 - - [23/Aug/2018:13:39:18 -0600] "GET /thredds/dodsC/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/OR_ABI-L1b-RadC-M3C02_G16_s20182341637323_e20182341640096_c20182341640138.nc.dods?Rad.Rad HTTP/1.1" 200 61339295 "-" "oc4.6.1" 10000

Here we see with 4.6.1, his connections always stop right around 10000, whereas with 4.5.0 it lasts until 20k, and the correct amount of data is returned. On my machine, I can confirm 4.6.1 is problematic both from conda-forge and Homebrew. 4.5.1 seems to have no problem. This is in a conda environment where the only thing changing is libnetcdf (no changes to netcdf4-python or libcurl).

@DennisHeimbigner did something about opendap/curl change in 4.6.1? (cc @WardF )

DennisHeimbigner commented 6 years ago

I do not think so, but I will have to review the pull requests. We recently saw an effect when switching the Tomcat version for url escape characters. Is is possible that there is some new timeout. Are the Tomcat/Apache version in these tests the same?

dopplershift commented 6 years ago

@DennisHeimbigner This is all using the same running TDS 5.0 instance on http://thredds-test.unidata.ucar.edu .

efiring commented 6 years ago

Maybe a related problem, maybe different, but the following (as a script):

from netCDF4 import Dataset
d = Dataset('http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/monthly_v2')
u = d.variables['uwnd'][:]

consistently fails for me with conda-forge libnetcdf 4.6.0 and 4.6.1, but not with 4.5.0-3. Output is:

(nc3) ~/Desktop $ python CCMP_failure.py
Traceback (most recent call last):
  File "CCMP_failure.py", line 3, in <module>
    u = d.variables['uwnd'][:]
  File "netCDF4/_netCDF4.pyx", line 4060, in netCDF4._netCDF4.Variable.__getitem__
  File "netCDF4/_netCDF4.pyx", line 4950, in netCDF4._netCDF4.Variable._get
  File "netCDF4/_netCDF4.pyx", line 1733, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: DAP failure

This is with a Mac and with Linux. With the URL above, the request works for some time slices, but not for others. With a related URL, I can't even get a single time slice; the following fails:

from netCDF4 import Dataset
d = Dataset('http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2')
print(d.variables['uwnd'].shape)
u = d.variables['uwnd'][0]

ocefpaf commented 6 years ago

@dopplershift at this point I'm inclined to revert all of conda-forge's builds to libnetcdf 4.5.0 :-(

It will be quite painful to re-build everything but at least users will get a working environment again. (Note that wheels are still building with 4.4.x, due to Python 2.7 on Windows.)

DennisHeimbigner commented 6 years ago

Can someone try this experiment. Modify the url to look like this: "[log][show=fetch]http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2" This assumes that python does not attempt to do anything with the url other than pass it to nc_open.

efiring commented 6 years ago

(py36) ~/Desktop $ python CCMP2.py
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dds
Log:fetch complete: 0.008 secs
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.das
Log:fetch complete: 0.002 secs
Log:prefetchable: lat=628
Log:prefetchable: lon=1440
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dds
Log:fetch complete: 0.002 secs
(42212, 628, 1440)
Log:prefetch: lat
Log:prefetch: lon
Log:prefetch.final: lat*[0:627/628],lon*[0:1439/1440]
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dods?lat,lon
Log:fetch complete: 0.003 secs
Log:prefetch.complete
Log:prefetch.vars:  lat lon

Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dods?uwnd.uwnd[0][0:627][0:1439]
Log:curl error: Timeout was reached
Log:fetch complete: 10.004 secs
Traceback (most recent call last):
  File "CCMP2.py", line 6, in <module>
    u = d.variables['uwnd'][0]
  File "netCDF4/_netCDF4.pyx", line 4105, in netCDF4._netCDF4.Variable.__getitem__
  File "netCDF4/_netCDF4.pyx", line 5016, in netCDF4._netCDF4.Variable._get
  File "netCDF4/_netCDF4.pyx", line 1745, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: DAP failure
(py36) ~/Desktop $

DennisHeimbigner commented 6 years ago

There may be two problems here.

the curl timeout is set at 10 secs (that may be my fault).
the server is taking as much as 10 sec to compute the response: is that normal?

Let me suggest another experiment. WRT #1. create a file in your home directory called .daprc (or .dodsrc) with this line in it: HTTP.TIMEOUT=time where time is the timeout in seconds. Zero is supposed to mean never time out, but I would suggest using an actual large integer such as 1000.

Meanwhile, I will see if I am the one setting the timeout to 10 seconds.

efiring commented 6 years ago

OK, now it looks like I am waiting for 1000 s to elapse.

DennisHeimbigner commented 6 years ago

That is odd. It implies that the server is taking an very long time to complete its response. I suppose one could set the HTTP.TIMEOUT value to something horrific like 100000 (seconds) to try to see if the server ever finishes its response.

In any case, I do set the default timeout to 10 seconds. Should I change that to zero so that the connection will never timeout?

DennisHeimbigner commented 6 years ago

On the other hand, the uwnd variable is 152692623360 bytes (~152 gig), so I guess it should take a long time to download.

DennisHeimbigner commented 6 years ago

As an aside, I have a fix in to allow the setting of the curl download buffer size (currently something like 16k? by default) It will allow setting it to is maximum of 512k. This will probably help some, but with such a large variable, its effect may be limited.

efiring commented 6 years ago

But I was downloading only one time slice, so it should have been quick.

It took over 500 s. It looks like at least part of the problem here is server-side. I'm trying this url now with the earlier libnetcdf, and it is still taking a long time. I'm in contact with the people running the server, so we can look into that. Was the default timeout changed for 4.6?

py36) ~/Desktop $ python CCMP2.py
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dds
Log:fetch complete: 0.045 secs
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.das
Log:fetch complete: 0.006 secs
Log:prefetchable: lat=628
Log:prefetchable: lon=1440
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dds
Log:fetch complete: 0.005 secs
(42212, 628, 1440)
Log:prefetch: lat
Log:prefetch: lon
Log:prefetch.final: lat*[0:627/628],lon*[0:1439/1440]
Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dods?lat,lon
Log:fetch complete: 0.012 secs
Log:prefetch.complete
Log:prefetch.vars:  lat lon

Log:fetch: http://apdrc.soest.hawaii.edu:80/dods/public_data/satellite_product/CCMP/6hourly_v2.dods?uwnd.uwnd[0][0:627][0:1439]
Log:fetch complete: 578.212 secs

efiring commented 6 years ago

As you can see above, the download was only 628x1440 floats, so unless they had to be collected from a huge number of files it should have been very fast. I would be surprised if a single time slice is not in a single file.

efiring commented 6 years ago

I'm beginning to think that the problem I originally reported might be an interaction between the timeout value and a badly-behaving (or at least oddly-behaving) server, since the test above took slightly longer (611 s) just now with libnetcdf 4.4.1.1.

dopplershift commented 6 years ago

@DennisHeimbigner I'm not sure about @efiring 's server, since it looks like it's a straight DODS server, but the original problem was from our own TDS running on http://thredds.ucar.edu. I would say that 10 seconds is way too quick. If that's the time the connection takes to finish, that's going to have to allow for download time. Speaking as someone on hotel wifi right now, we cannot be assuming that users have a performant network connection.

So if that means setting it to something large like 1800 (half an hour) or just disabling it altogether, I leave that up to the netCDF team or other interested parties to decide. But I can say that with netCDF 4.5, the connections were finishing in 15-25 seconds, so the new 10 seconds is (relative to real-world) ridiculously short.

DennisHeimbigner commented 6 years ago

I am going to set the default timeout to a much larger value. I dislike setting it to infinity. Probably something like 1/2 hour should cover essentially all cases.

The slow response time from the server is baffling. Some possible causes (all unlikely):

the server is taking a long time to compute and assemble the slice ready for shipping.
the http protocol is taking a long time to ship the data (maybe http chunk is small)
there is some kind of proxy thru which the server is sending data. Looking at the server logs might tell you if #1 is the issue.

ocefpaf commented 6 years ago

@DennisHeimbigner while you fix this upstream I'm looking for a short term solution to the users of the packages we are building on conda-forge. My options are:

rebuilding everything against netcdf 4.4.x, which is the last version that "just works" and we have packages on Windows and *nix;
rebuild nix with 4.5.x and Windows with 4.4.x so we don't break nco users that need 4.5.x features on nix;
rebuild 4.6.1 with a higher timeout value.

Both 1 and 2 are a lot of work and will take some time. So my question is: is there a way to do 3 with a flag or something? Or, if you point me to a PR when you address it, I can try to backport it.

DennisHeimbigner commented 6 years ago

You also can use .daprc to set the timeout as I indicated in a previous comment. If you want to modify the code to use a longer timeout, then do this: In libdispatch/dauth.c, about line 33, change "HTTP.TIMEOUT","10", /seconds / to "HTTP.TIMEOUT","1800", /seconds / That will give a 30minute timeout and is the value I plan to change it to (unless you have a preferable value). If you change the value to zero, then the timeout should be infinite, but IMO that is not a good idea. Let me know what you choose.

ocefpaf commented 6 years ago

You also can use .daprc to set the timeout as I indicated in a previous comment.

My guess is that a .daprc is something for the user space, no?

If you want to modify the code to use a longer timeout, then do this: In libdispatch/dauth.c, about line 33, change "HTTP.TIMEOUT","10", /*seconds */ to "HTTP.TIMEOUT","1800", /*seconds */

I'll give that a try. Thanks.

DennisHeimbigner commented 6 years ago