Unidata / thredds

THREDDS Data Server v4.6
https://www.unidata.ucar.edu/software/tds/v4.6/index.html
266 stars 179 forks source link

HDF5 DataBTree Issue #518

Closed dblodgett-usgs closed 8 years ago

dblodgett-usgs commented 8 years ago

This issue supercedes #501.

I have found that I can reproduce the stack trace below with a reproducible workflow.

  1. Request data from an ncml union aggregation [1] of OPeNDAP source data and a local netcdf file.
  2. Request data from a data source that has 2-D coordinate axes. [2], [3].

After some period of time (tens of minutes), this issue goes away for 2-d coordinate axis datasets that have not been accessed producing the error. If the offending aggregation is accessed then the working dataset is accessed, it brakes like the others. That is, the databtree error is being cached in some way.

[1] http://cida-test.er.usgs.gov/thredds/dodsC/stageiv/ncml/stageivRT.ncml.html Grib collection: http://cida-test.er.usgs.gov/thredds/catalog/stageiv/01hRT/catalog.html .ncml union can be downloaded here: http://cida-test.er.usgs.gov/thredds/catalog/stageiv/ncml/catalog.html [2] http://cida-test.er.usgs.gov/thredds/dodsC/UpperDeschutes.html If the data behind this is needed, it is visible here: http://cida-test.er.usgs.gov/thredds/catalog/stageiv/symlinked_files/deschutes/catalog.html [3] http://cida-test.er.usgs.gov/thredds/dodsC/stageiv/symlinked_files/stageiv_archive.nc.html This is a single NetCDF4 File here: http://cida-test.er.usgs.gov/thredds/catalog/stageiv/symlinked_files/catalog.html?dataset=examples/symlinked_files/stageiv_archive.nc

2016-04-06T13:35:44.315 -0500 [    408330][      44] ERROR - thredds.server.opendap.OpendapServlet - path2= /ncep_stageiv.ascii
java.lang.NullPointerException
    at ucar.unidata.io.RandomAccessFile.read_(RandomAccessFile.java:760) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.unidata.io.RandomAccessFile.readBuffer(RandomAccessFile.java:520) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.unidata.io.RandomAccessFile.seek(RandomAccessFile.java:508) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.iosp.hdf5.DataBTree$Node.<init>(DataBTree.java:170) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.iosp.hdf5.DataBTree$DataChunkIterator.<init>(DataBTree.java:135) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.iosp.hdf5.DataBTree.getDataChunkIteratorFilter(DataBTree.java:61) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.iosp.hdf5.H5tiledLayoutBB.<init>(H5tiledLayoutBB.java:110) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:162) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:141) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2009) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.Variable.reallyRead(Variable.java:913) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.Variable._read(Variable.java:898) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.Variable.read(Variable.java:709) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.dataset.VariableDS.reallyRead(VariableDS.java:557) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.dataset.VariableDS._read(VariableDS.java:537) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.Variable.read(Variable.java:709) ~[cdm-4.6.5.jar:4.6.5]
    at ucar.nc2.Variable.read(Variable.java:683) ~[cdm-4.6.5.jar:4.6.5]
    at thredds.server.opendap.NcSDArray.read(NcSDArray.java:115) ~[NcSDArray.class:?]
    at opendap.servlet.AsciiWriter.writeAsc(AsciiWriter.java:107) ~[opendap-4.6.5.jar:4.6.5]
    at opendap.servlet.AsciiWriter.toASCII(AsciiWriter.java:68) ~[opendap-4.6.5.jar:4.6.5]
    at thredds.server.opendap.OpendapServlet.doGetASC(OpendapServlet.java:315) ~[OpendapServlet.class:?]
    at thredds.server.opendap.OpendapServlet.doGet(OpendapServlet.java:220) [OpendapServlet.class:?]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:621) [servlet-api.jar:?]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:728) [servlet-api.jar:?]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305) [catalina.jar:7.0.39]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) [catalina.jar:7.0.39]
    at thredds.servlet.filter.CookieFilter.doFilter(CookieFilter.java:55) [CookieFilter.class:?]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) [catalina.jar:7.0.39]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) [catalina.jar:7.0.39]
    at thredds.servlet.filter.RequestQueryFilter.doFilter(RequestQueryFilter.java:118) [RequestQueryFilter.class:?]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) [catalina.jar:7.0.39]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) [catalina.jar:7.0.39]
    at thredds.servlet.filter.RequestCORSFilter.doFilterInternal(RequestCORSFilter.java:49) [RequestCORSFilter.class:?]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) [spring-web-4.1.6.RELEASE.jar:4.1.6.RELEASE]
    at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:344) [spring-web-4.1.6.RELEASE.jar:4.1.6.RELEASE]
    at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:261) [spring-web-4.1.6.RELEASE.jar:4.1.6.RELEASE]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) [catalina.jar:7.0.39]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) [catalina.jar:7.0.39]
    at thredds.servlet.filter.RequestPathFilter.doFilter(RequestPathFilter.java:94) [RequestPathFilter.class:?]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) [catalina.jar:7.0.39]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) [catalina.jar:7.0.39]
    at thredds.server.RequestBracketingLogMessageFilter.doFilter(RequestBracketingLogMessageFilter.java:81) [RequestBracketingLogMessageFilter.class:?]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) [catalina.jar:7.0.39]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) [catalina.jar:7.0.39]
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) [catalina.jar:7.0.39]
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) [catalina.jar:7.0.39]
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) [catalina.jar:7.0.39]
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) [catalina.jar:7.0.39]
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) [catalina.jar:7.0.39]
    at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:947) [catalina.jar:7.0.39]
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) [catalina.jar:7.0.39]
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) [catalina.jar:7.0.39]
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1009) [tomcat-coyote.jar:7.0.39]
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) [tomcat-coyote.jar:7.0.39]
    at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) [tomcat-coyote.jar:7.0.39]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
dblodgett-usgs commented 8 years ago

For the record, I've modified how I'm doing this workflow such that I do not hit the feature collection in a joinExisting aggregation any more. My new approach uses netCDFSubset against the grib feature collection to create real-time updated netcdf files for the real time content. I then use ncml to strip out the bad x/y coordinates and join the real time files to archive content. So far, this is not causing the problems we were seeing.

cwardgar commented 8 years ago

@dblodgett-usgs I got a chance to work on this now and would like to recreate the issue, first on 4.6.4, then possibly on 4.6.3 and 4.6.5. Can you provide all of the catalogs, ncml, data, etc? Unfortunately, the links you gave above are now broken.

dblodgett-usgs commented 8 years ago

OK, @cwardgar - I changed our qa server to version 4.6.5 and put all those links back together from when I was testing and blew them all away in my haste to get things figured out. I am able to reproduce the issue with the series of requests I describe... no idea if there are other ways to reproduce. I would guess the OPeNDAP + local netcdf file is a red-herring, but am tapped on trying to test this stuff.

Thanks for looking into it!!

cwardgar commented 8 years ago

I'm confused: so you no longer have the original catalogs, ncml, and data that you used in the above workflow?

cwardgar commented 8 years ago

Oh, I see now that you edited the OP 22 hours ago. Most of the links are 404s or blank pages, unfortunately.

dblodgett-usgs commented 8 years ago

Sorry I inadvertently overwrote the dataset scan when I deployed some other changes. I will put the data set scan back in place later tonight.

On May 13, 2016 3:48:27 PM CDT, Christian W notifications@github.com wrote:

Oh, I see now that you edited the OP 22 hours ago. Most of the links are 404s or blank pages, unfortunately.


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/Unidata/thredds/issues/518#issuecomment-219155317

Sent from my Android device with K-9 Mail. Please excuse my brevity.

dblodgett-usgs commented 8 years ago

I put the datasetScan back in place. Really sorry for the run around. The links should all work now.

cwardgar commented 8 years ago

I'm having trouble reproducing this one as well. Here's my setup:

[cwardgar@lenny ~/dev/data/gh-518]
$ tree
.
├── ncep
│   ├── ST4.2016041921.01h
│   ├── ST4.2016041921.01h.gbx9
│   ├── ST4.2016041922.01h
│   ├── ST4.2016041922.01h.gbx9
│   ├── ST4.2016041923.01h
│   ├── ST4.2016041923.01h.gbx9
│   └── stageiv_01hRT.ncx3
├── ncml
│   ├── lat_lon.nc
│   ├── scan
│   │   ├── stageiv_archive.ncml
│   │   └── stageiv_weekly.ncml
│   ├── stageivRT.ncml
│   └── stageiv_combined.ncml
└── symlinked_files
    ├── deschutes
    │   └── UpperDeschutes_t4_snowfall.nc
    └── stageiv_archive.nc

5 directories, 14 files

stageiv.xml catalog:

<?xml version="1.0"?>
<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink" name="CIDA THREDDS Server.">
    <service name="all" base="" serviceType="compound">
        <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/"/>
        <service name="ncss" serviceType="NetcdfSubset" base="/thredds/ncss/"/>
        <service name="HTTPServer" serviceType="HTTPServer" base="/thredds/fileServer/"/>
    </service>
    <dataset name="National Stage IV Quantitative Precipitation Estimate Mosaic Real Time" ID="cida.usgs.gov/thredds/stageiv_01hRT/">
        <featureCollection featureType="GRIB1" name="stageiv_01hRT" path="stageiv/01hRT">
            <metadata inherited="true">
                <serviceName>all</serviceName>
                <dataType>Grid</dataType>
            </metadata>
            <!-- <collection name="stageiv_01hRT" spec="/opt/tomcat/data/qpe/ncep/ST4.*01h$"/> -->
            <collection name="stageiv_01hRT" spec="/Users/cwardgar/dev/data/gh-518/ncep/ST4.*01h$"/>
            <update startup="test" rescan="0 0/5 * * * ? *"/>
            <gribConfig datasetTypes="Best"/>
        </featureCollection>
        <dataset name="all" ID="cida.usgs.gov/thredds/stageiv_combined/" serviceName="all" urlPath="stageiv/stageiv_combined">
            <!-- <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="/opt/tomcat/data/qpe/ncml/stageiv_combined.ncml"/> -->
            <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="/Users/cwardgar/dev/data/gh-518/ncml/stageiv_combined.ncml"/>
        </dataset>
    </dataset>
    <!-- <datasetScan name="example" ID="example" path="stageiv" location="/opt/tomcat/data/qpe/"> -->
    <datasetScan name="example" ID="example" path="stageiv" location="/Users/cwardgar/dev/data/gh-518/">
        <metadata inherited="true">
            <serviceName>all</serviceName>
        </metadata>
    </datasetScan>
</catalog>

After starting the server, these are my requests, in order:

http://localhost:8080/thredds/dodsC/stageiv/ncml/stageivRT.ncml.ascii?lon[0:10:1120][0:10:880]
http://localhost:8080/thredds/dodsC/stageiv/symlinked_files/deschutes/UpperDeschutes_t4_snowfall.nc.ascii?lon[0:1:0][0:1:0]
http://localhost:8080/thredds/dodsC/stageiv/symlinked_files/stageiv_archive.nc.ascii?lon[0:10:1120][0:10:880]

All requests succeed for me. Am I missing something?

One big difference between your setup and mine is that I don't have a comparable endpoint for http://cida-test.er.usgs.gov/thredds/dodsC/UpperDeschutes.html. Is that an aggregation of the files in this directory? Can you provide the NcML and/or catalog entries for that? It could be the problem.

dblodgett-usgs commented 8 years ago

OK, I bet it has to be at scale... let me see what I can work up that isn't too onerous for you to get set up.

The UpperDeschutes is just a union of those files. I'll get back with you.

dblodgett-usgs commented 8 years ago

I am also struggling to reproduce this with any regularity now. It seems to be due to server load and scale. @jiwalker-usgs and I will try to track down exactly what is being overloaded to cause the issue, but reproducing it locally may be difficult. We'll do what we can.

dblodgett-usgs commented 8 years ago

Alright, @jiwalker-usgs is reproducing this on a third environment, but here's the procedure: Data Artifacts: 1) There is a zipped package of data here: ftp://ftpext.usgs.gov/pub/er/wi/middleton/dblodgett/rfc_qpe_example/data.zip 2) The catalog.xml I used is here: ftp://ftpext.usgs.gov/pub/er/wi/middleton/dblodgett/rfc_qpe_example/catalog.xml Note that you'll need to change the paths in the catalog. 3) There are two scripts here: ftp://ftpext.usgs.gov/pub/er/wi/middleton/dblodgett/rfc_qpe_example/hammer_rfc.sh and here: ftp://ftpext.usgs.gov/pub/er/wi/middleton/dblodgett/rfc_qpe_example/hammer_coop.sh that I use to load up the server.

Once you have the catalog setup and the data wired up correctly, you'll have three service urls: 1) http://localhost:8080/thredds/dodsC/UpperDeschutes_t4p10_swemelt.nc 2) http://localhost:8080/thredds/dodsC/stageiv_archive.nc 3) http://localhost:8080/thredds/dodsC/stageiv_combined.ncml

To reproduce the databtree issue, 1) kick off hammer_coop.sh. It will run in the background for a minute or two. 2) While that's running wget or otherwise download the coodinate variables of the second service: http://localhost:8080/thredds/dodsC/stageiv_archive.nc.dods?time[0:1:999],lon[0:1:1120][0:1:880],lat[0:1:1120][0:1:880] 3) Immediately after making that request, kick off hammer_rfc.sh running it at least twice concurrently.

Watching logs, during this procedure's climax, I see various errors all seemingly related to the HDF5 IOSP.

jiwalker-usgs commented 8 years ago

I'm going to try to do a bit of digging to see what pieces are interacting here to get to this error. @cwardgar I'll let you know what I find.

cwardgar commented 8 years ago

Good news: I reproduced the problem! I updated the shell scripts to make it even easier to demonstrate.

barf.sh:

#!/bin/bash
./hammer_coop.sh &
echo "Started hammer_coop.sh in the background (PID = $!)."

echo "Sleeping for 15 seconds."
sleep 15

echo "Downloading coordinate variables."
wget -q http://localhost:8080/thredds/dodsC/stageiv_archive.nc.dods?time[0:1:999],lon[0:1:1120][0:1:880],lat[0:1:1120][0:1:880] -O temp2.ascii

./hammer_rfc.sh &
echo "Started hammer_rfc.sh in the background (PID = $!)."

./hammer_rfc.sh &
echo "Started hammer_rfc.sh in the background (PID = $!)."

wait  # Wait for background processes to finish.

hammer_coop.sh:

#!/bin/bash
urlBase=http://localhost:8080/thredds/dodsC/UpperDeschutes_t4p10_swemelt.nc
var=.ascii?UpperDeschutes_t4p10_swemelt

for i in $(seq 8394); do
  # See http://stackoverflow.com/a/24520924/3874643
  if ! kill -0 $PPID 2>/dev/null; then
    echo "[hammer_coop.sh]: Parent process (PPID = $PPID) no longer running. Aborting."
    exit 1
  fi

  wget -q $urlBase$var[$i:1:$i][150:1:155][150:1:155] -O temp1.ascii
  status=$?

  if [ $status -ne 0 ] ; then
    echo "[hammer_coop.sh]: wget returned $status. Aborting."
    break
  fi
done

hammer_rfc.sh:

#!/bin/bash
urlBase=http://localhost:8080/thredds/dodsC/stageiv_archive.nc
var=.ascii?Total_precipitation_surface_1_Hour_Accumulation

for i in $(seq 999); do
  # See http://stackoverflow.com/a/24520924/3874643
  if ! kill -0 $PPID 2>/dev/null; then
    echo "[hammer_rfc.sh]: Parent process (PPID = $PPID) no longer running. Aborting."
    exit 1
  fi

  wget -q $urlBase$var[$i:1:$i][500:1:550][700:1:750] -O temp3.ascii
  status=$?

  if [ $status -ne 0 ] ; then
    echo "[hammer_rfc.sh]: wget returned $status. Aborting."
    break
  fi
done
[cwardgar@lenny ~/dev/data/gh-518]
$ ./barf.sh 
Started hammer_coop.sh in the background (PID = 13039).
Sleeping for 15 seconds.
Downloading coordinate variables.
Started hammer_rfc.sh in the background (PID = 16108).
Started hammer_rfc.sh in the background (PID = 16109).
[hammer_rfc.sh]: wget returned 8. Aborting.
[hammer_rfc.sh]: wget returned 8. Aborting.
[hammer_coop.sh]: wget returned 8. Aborting.

I'm hardly a bash scripting expert but I managed to get the child scripts to abort when there's a server error and abort when the parent process has been terminated (like with Ctrl+C).

I'm getting several different errors:

java.lang.IllegalStateException: java.util.zip.ZipException: incorrect header check
    at ucar.nc2.iosp.LayoutBBTiled.hasNext(LayoutBBTiled.java:134)
        ...
java.lang.AssertionError: shape[2] (385) >= pt[2] (390)
    at ucar.nc2.iosp.hdf5.Tiling.tile(Tiling.java:106)
        ...

And others. Interestingly, I haven't seen the exception in the OP. In any event, I strongly suspect H5iosp. I want to see what happens when I bypass it.

cwardgar commented 8 years ago

I have good news! Well, kind of.

I changed THREDDS to open NetCDF-4 files using ucar.nc2.jni.netcdf.Nc4Iosp (which delegates to the native C library) instead of ucar.nc2.iosp.hdf5.H5iosp (which is pure Java). After doing so, my barf.sh script completed without error! So, as I expected, the problem is somewhere in the HDF5 Java layer.

We've experienced several issues with HDF5 lately, and they're proving very difficult to fix. The implementation is already massively complex (just try to wrap your head around H5header) and the HDF Group keeps moving the goal posts on us as well. John Caron made a valiant attempt at reimplementation, but I think it's time we gave up on the pure-Java dream.

So, in the short term (i.e. v4.6.6), my plan is to add a configuration option to THREDDS that allows users to read NetCDF-4 files using the native library. In the long term (i.e. v5.0.0), that will be the only option. Yes, it places an additional burden on users to install the library, but I think it's the only sensible choice we have for maintainability.

@lesserwhirls, @DennisHeimbigner, @ethanrd, @dopplershift, @JohnLCaron: Any thoughts?

dopplershift commented 8 years ago

Well, one question: where would that leave HDF5 support in netcdf-java? Does that also call for a move to the HDF Group's JNI wrappers for HDF5?

JohnLCaron commented 8 years ago

Are we talking about HDF 1.10 with all the problems?

If there are 1.8 problem files, could you keep them all some place and I will fix them at some point.

Keeping the java-only option for versions < 1.10 is probably worth doing. The cost of compiling HDF5 is non-trivial.

Theres something bad about the tyranny of a single library for a data format. For a long time, one could only read HDF-EOS files with an unwieldly library that some defense contractor wanted $$$$$ to maintain. That locked up those files for casual users. I think it could happen again with HDF-5.

On Tue, May 31, 2016 at 5:30 PM, Ryan May notifications@github.com wrote:

Well, one question: where would that leave HDF5 support in netcdf-java? Does that also call for a move to the HDF Group's JNI wrappers for HDF5?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/thredds/issues/518#issuecomment-222851704, or mute the thread https://github.com/notifications/unsubscribe/AAlcwBHYvuoJSiK9g2sOII5_9U5umcYWks5qHMSngaJpZM4IBbs5 .

lesserwhirls commented 8 years ago

Hi @JohnLCaron - the DataBTree problem is with 1.8.x. The IOSP will not even think about opening a 1.10 HDF5 file, as the default Superblock version was bumped and the IOSP does not know how to read those. I have not looked into what the differences are between Superblock 2 and 3 - maybe it's trivial?

I keep going back and fourth in my mind about the HDF5 IOSP, and whether or not to keep it pure java or go with calling into the C code at some level (whether we make the calls directly, or use the hdf5-java API). Aparently, with 1.10, you can load custom filters into the c library quite easily. If someone does this (and they have), then they will also need to implement the filter in java to be used in the pure java iosp. Calling into the c library would make life easer for those using custom filters, but is that a good enough reason to make the jump to calling C for HDF5 >= 1.10?

Perhaps this would be better discussed in a separate issue.

cwardgar commented 8 years ago

@dopplershift

where would that leave HDF5 support in netcdf-java?

Good point; H5iosp is also responsible for reading generic HDF5 files.

Does that also call for a move to the HDF Group's JNI wrappers for HDF5?

Yes, and then H5iosp would become a thin wrapper around that.

@JohnLCaron

If there are 1.8 problem files, could you keep them all some place and I will fix them at some point.

Sure, I've created an "Area: HDF" label and applied it to the relevant issues. I was only able to locate 2, but I swear we've seen others recently. I'll keep a look out.

Keeping the java-only option for versions < 1.10 is probably worth doing.

Fair enough, but this issue in particular was gnarly enough to make me throw my hands up in the air. Having written the code, I'm sure you'll have an easier time locating the problem than me, but just know that you won't be able to use the debugger to help.

JohnLCaron commented 8 years ago

but just know that you won't be able to use the debugger to help

I dont understand?

On Thu, Jun 2, 2016 at 4:00 PM, Christian W notifications@github.com wrote:

@dopplershift https://github.com/dopplershift

where would that leave HDF5 support in netcdf-java?

Good point; H5iosp is also responsible for reading generic HDF5 files.

Does that also call for a move to the HDF Group's JNI wrappers for HDF5?

Yes, and then H5iosp would become a thin wrapper around that.

@JohnLCaron https://github.com/JohnLCaron

If there are 1.8 problem files, could you keep them all some place and I will fix them at some point.

Sure, I've created an "Area: HDF" label and applied it to the relevant issues. I was only able to locate 2, but I swear we've seen others recently. I'll keep a look out.

Keeping the java-only option for versions < 1.10 is probably worth doing.

Fair enough, but this issue in particular was gnarly enough to make me throw my hands up in the air. Having written the code, I'm sure you'll have an easier time locating the problem than me, but just know that you won't be able to use the debugger to help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/thredds/issues/518#issuecomment-223435726, or mute the thread https://github.com/notifications/unsubscribe/AAlcwO5tUPjG5_jdcLL-N-L2KxEEWZB6ks5qH1JugaJpZM4IBbs5 .

JohnLCaron commented 8 years ago

I understand that Unidata doesnt have the personell to support the HDF5 java iosp, and using the C library is a safe thing to do. (however dont underestimate the trouble with JNI and the very poor error messages from the C layers, both HDF5 and netcdf4. You will need a specialist to maintain this, too).

My intention is to keep the HDF5/netCDF java library viable for < 1.10. I think its a modest incremental effort. It will be a volunteer best-effort sort of thing. I dont know why you wouldnt use it if the C library isnt available.

One aspect of some of these bugs is due to poor/non-existent file format documentation. I think Robert Schmunk tracked the last one down, after a years wait for requested documentation. So the viability of the effort depends on cooperation from THG. If you are talking with them, would be good to bring that up. Its really a strong benefit to them to have an independent implementation from their docs.

As for custom filters, anyone using those has a non-standard, non-interoperable file. Sort of the antithesis of what netcdf is about. OTOH, some of those custom filters should/could become part of the standard, then we would want to support them.

On Thu, Jun 2, 2016 at 9:48 PM, John Caron jcaron1129@gmail.com wrote:

but just know that you won't be able to use the debugger to help

I dont understand?

On Thu, Jun 2, 2016 at 4:00 PM, Christian W notifications@github.com wrote:

@dopplershift https://github.com/dopplershift

where would that leave HDF5 support in netcdf-java?

Good point; H5iosp is also responsible for reading generic HDF5 files.

Does that also call for a move to the HDF Group's JNI wrappers for HDF5?

Yes, and then H5iosp would become a thin wrapper around that.

@JohnLCaron https://github.com/JohnLCaron

If there are 1.8 problem files, could you keep them all some place and I will fix them at some point.

Sure, I've created an "Area: HDF" label and applied it to the relevant issues. I was only able to locate 2, but I swear we've seen others recently. I'll keep a look out.

Keeping the java-only option for versions < 1.10 is probably worth doing.

Fair enough, but this issue in particular was gnarly enough to make me throw my hands up in the air. Having written the code, I'm sure you'll have an easier time locating the problem than me, but just know that you won't be able to use the debugger to help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/thredds/issues/518#issuecomment-223435726, or mute the thread https://github.com/notifications/unsubscribe/AAlcwO5tUPjG5_jdcLL-N-L2KxEEWZB6ks5qH1JugaJpZM4IBbs5 .

cwardgar commented 8 years ago

but just know that you won't be able to use the debugger to help

I dont understand?

In short, I could reproduce the problem when running THREDDS normally, but not when running it in the debugger.

The problem is load-dependent, requiring a large number of concurrent requests in order to manifest. I assume that since the JIT compiler isn't optimizing code as much when debugging is enabled, the requests aren't being processed fast enough to demonstrate the concurrency issues in the I/O. Just a theory.

JohnLCaron commented 8 years ago

This doesnt look like the usual HFD5 iosp cant read a file, but something very arcane. it may or may not be in hdf5 iosp. if not in hdf5 iosp, you may see the problem even if using C library.

@cwardgar can you make all your scripts available? also what version thredds did you use? is it reproducible on both 4.6 and 5.0?

cwardgar commented 8 years ago

Like I said, this one's gnarly.

All of my scripts are here. Just execute barf.sh after starting the server. I reproduced on the latest 4.6; didn't try it on 5.0.

Also, as I mentioned here, the problem does not manifest when I modify NetcdfFile to use Nc4Iosp to open the files instead of H5Iosp. I still think that the hdf5 iosp is the culprit.

cwardgar commented 8 years ago

558 enabled a workaround for this but the problem remains. I'll leave it open in case John wants to work on it some more.

JohnLCaron commented 8 years ago

Ok, Im guessing that this is a multithreading problem somewhere in the tangle of ncml and caching, where the same open file is being accessed in 2 threads. The reason that the C library "fixes" the problem is that its single threaded, so serializes access even when called from multiple java threads.

The other possibility is that the H5iosp code is not thread safe. If that was true, we wouldnt need a thredds server and Ncml to reproduce.

Finally maybe theres a bug in the H5iosp reading code, but its hard to imagine why only a multithreading test finds it.

Let me see if I can reproduce it first, then we can try to track it down.

JohnLCaron commented 8 years ago

Ok, can reproduce on 5.0 running on intelliJ in debug mode. Adding some info to the error log message that may be able to confirm if raf pointer is getting clobbered.

JohnLCaron commented 8 years ago

Ok, hers my log output when it fails:

java.lang.IllegalStateException: DataBTree doesnt start with TREE err=parent=-1 address=16593 getFileOffset=16593 raf.getFilePointer=58405914

code looks like:

  raf.order(RandomAccessFile.LITTLE_ENDIAN); // header information is in le byte order
  long pos = h5.getFileOffset(address);
  raf.seek( pos);
  this.address = address;

  String magic = raf.readString(4);
  if (!magic.equals("TREE")) {
    String err = String.format("parent=%d address=%d getFileOffset=%d raf.getFilePointer=%d", parent, address, pos, raf.getFilePointer());
    throw new IllegalStateException("DataBTree doesnt start with TREE err=" + err);
  }

as you see, we position at pos, but then when we read, the file pointer has been munged (should be at pos+4). i think that means that the raf is being used by another thread.

JohnLCaron commented 8 years ago

I thought that the NcML aggregation would be to blame, but AFAIU it looks like you are just hammering 2 different files, apparently calling them asynchronously (not waiting for the previous call to finish).

Its funny that it always happens at the same place in DataBtree, but maybe thats the vulnerable spot where an IO stall would swicth to another thread.

Anyway, it points to not being -aggregation (thank god).

1) could be raf caching.

2) There is this unneeded wrapping in Ncml:

<dataset name="UpperDeschutes_t4p10_swemelt.nc" ID="contrived.example.1" serviceName="dap" urlPath="UpperDeschutes_t4p10_swemelt.nc">
    <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="D:/work/blodgett/data/UpperDeschutes_t4p10_swemelt.nc"/>
</dataset>
<dataset name="stageiv_archive.nc" ID="contrived.example.3" serviceName="dap" urlPath="stageiv_archive.nc">
    <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location="D:/work/blodgett/data/stageiv_archive.nc"/>
</dataset>

could remove that and see if it reproduces, if still fails, then its probably raf caching.

if no longer fails, maybe the interaction of NcML and raf caching.

JohnLCaron commented 8 years ago

heres the full stack trace for the record:

2016-06-22T21:33:18 [54491] [412] ERROR thredds.server.opendap.OpendapServlet: request= ReqState: serverClassName: 'null' dataSet: 'stageiv_archive.nc' requestSuffix: 'ascii' CE: 'Total_precipitation_surface_1_Hour_Accumulation[34:1:34][500:1:550][700:1:750]' compressOK: false InitParameters:

java.lang.IllegalStateException: DataBTree doesnt start with TREE err=parent=16593 address=58405906 getFileOffset=58405906 raf.getFilePointer=58405914
    at ucar.nc2.iosp.hdf5.DataBTree$Node.<init>(DataBTree.java:210) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.DataBTree$Node.first(DataBTree.java:277) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.DataBTree$DataChunkIterator.<init>(DataBTree.java:170) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.DataBTree.getDataChunkIteratorFilter(DataBTree.java:94) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.H5tiledLayoutBB.<init>(H5tiledLayoutBB.java:110) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:166) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:145) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2037) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.Variable.reallyRead(Variable.java:876) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.Variable._read(Variable.java:861) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.Variable.read(Variable.java:672) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.dataset.VariableDS.reallyRead(VariableDS.java:567) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.dataset.VariableDS._read(VariableDS.java:545) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.Variable.read(Variable.java:672) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.Variable.read(Variable.java:646) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at thredds.server.opendap.NcSDArray.read(NcSDArray.java:115) ~[classes/:5.0.0-SNAPSHOT]
    at opendap.servlet.AsciiWriter.writeAsc(AsciiWriter.java:107) ~[edu.ucar-opendap-5.0.0-SNAPSHOT.jar:?]
    at opendap.servlet.AsciiWriter.toASCII(AsciiWriter.java:68) ~[edu.ucar-opendap-5.0.0-SNAPSHOT.jar:?]
    at thredds.server.opendap.OpendapServlet.doGetASC(OpendapServlet.java:287) ~[classes/:5.0.0-SNAPSHOT]
    at thredds.server.opendap.OpendapServlet.doGet(OpendapServlet.java:195) [classes/:5.0.0-SNAPSHOT]
    at sun.reflect.GeneratedMethodAccessor50.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_73]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[?:1.8.0_73]
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:222) [spring-web-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:137) [spring-web-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:110) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:814) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:737) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:85) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:959) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:893) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:622) [servlet-api.jar:?]
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846) [spring-webmvc-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:729) [servlet-api.jar:?]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:291) [catalina.jar:8.0.26]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.26]
    at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) [tomcat-websocket.jar:8.0.26]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239) [catalina.jar:8.0.26]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.26]
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:316) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at thredds.servlet.filter.RequestBracketingLogMessageFilter.doFilter(RequestBracketingLogMessageFilter.java:81) [classes/:5.0.0-SNAPSHOT]
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at thredds.servlet.filter.RequestCORSFilter.doFilterInternal(RequestCORSFilter.java:81) [classes/:5.0.0-SNAPSHOT]
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107) [spring-web-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at thredds.servlet.filter.RequestQueryFilter.doFilter(RequestQueryFilter.java:120) [classes/:5.0.0-SNAPSHOT]
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at thredds.servlet.filter.HttpHeadFilter.doFilter(HttpHeadFilter.java:46) [classes/:5.0.0-SNAPSHOT]
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:330) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:213) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:176) [spring-security-web-4.0.3.RELEASE.jar:4.0.3.RELEASE]
    at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:346) [spring-web-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:262) [spring-web-4.2.3.RELEASE.jar:4.2.3.RELEASE]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239) [catalina.jar:8.0.26]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.26]
    at org.apache.logging.log4j.web.Log4jServletFilter.doFilter(Log4jServletFilter.java:71) [log4j-web-2.4.1.jar:2.4.1]
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:239) [catalina.jar:8.0.26]
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) [catalina.jar:8.0.26]
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219) [catalina.jar:8.0.26]
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:106) [catalina.jar:8.0.26]
    at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:502) [catalina.jar:8.0.26]
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:142) [catalina.jar:8.0.26]
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79) [catalina.jar:8.0.26]
    at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:616) [catalina.jar:8.0.26]
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88) [catalina.jar:8.0.26]
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:518) [catalina.jar:8.0.26]
    at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1091) [tomcat-coyote.jar:8.0.26]
    at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:673) [tomcat-coyote.jar:8.0.26]
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1526) [tomcat-coyote.jar:8.0.26]
    at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1482) [tomcat-coyote.jar:8.0.26]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_73]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_73]
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-util.jar:8.0.26]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]
JohnLCaron commented 8 years ago

changed catalog to:

<catalog xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="THREDDS Server Default Catalog : You must change this to fit your server!" xsi:schemaLocation="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0            http://www.unidata.ucar.edu/schemas/thredds/InvCatalog.1.0.6.xsd">
    <datasetRoot path="bloodyHell" location="D:/work/blodgett/data" />
    <service name="dap" base="" serviceType="compound">
        <service name="odap" serviceType="OpenDAP" base="/thredds/dodsC/"/>
    </service>
    <dataset name="UpperDeschutes_t4p10_swemelt.nc" ID="contrived.example.1" serviceName="dap" urlPath="bloodyHell/UpperDeschutes_t4p10_swemelt.nc" />
    <dataset name="stageiv_archive.nc" ID="contrived.example.3" serviceName="dap" urlPath="bloodyHell/stageiv_archive.nc" />
    <dataset name="stageiv_combined.ncml" ID="contrived.example.4" serviceName="dap" urlPath="bloodyHell/stageiv_combined.ncml" />
</catalog>

error still happens

JohnLCaron commented 8 years ago

Disable raf caching in threddsConfig.xml:

  <RandomAccessFile>
    <maxFiles>0</maxFiles>
  </RandomAccessFile>

fails immediately, in a different way:

java.lang.NullPointerException
    at ucar.unidata.io.RandomAccessFile.read_(RandomAccessFile.java:761) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.unidata.io.RandomAccessFile.readBuffer(RandomAccessFile.java:521) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.unidata.io.RandomAccessFile.seek(RandomAccessFile.java:509) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.DataBTree$Node.<init>(DataBTree.java:204) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
...

this looks like a different bug with raf caching, maybe not involving multithreading.

JohnLCaron commented 8 years ago

Ok, I see a (the?) problem. It is in H5iosp, which is caching the raf, which gets clobbered when the file is released and reacquired. Its going to take me a while on this one.

cwardgar commented 8 years ago

Makes sense; we had a similar issue in #435 where a cached RandomAccessFile was being used instead of the one in AbstractIOServiceProvider, which gets properly rehydrated when the dataset is reacquired.

dblodgett-usgs commented 8 years ago

Thanks for taking the time to work through this @JohnLCaron - I would really rather not switch to using the C library on our production systems if this could be fixed. I've tested the 4.6.6 with the C library enough to know it isn't quite as robust. It seems that aggregation cache scans are a bit slower which is a problem since aggregation caching is still not hitting (using the cache) after restarts or after time/server usage has cleared whatever non-file cache is in place.

We are going to stick with 4.5.5 until we've rewritten a bunch of data to avoid joinExisting scans and tested 4.6.6 under load a bit more.

JohnLCaron commented 8 years ago

Ok, im making a pull request on 5.0. Its simple so it should be easy to backport to 4.6.

Turns out the RAF caching wasnt the problem, though it was masking / mitigating the problem. When you turn caching off:

 <RandomAccessFile>
    <maxFiles>0</maxFiles>
  </RandomAccessFile>

Then the problem happens as soon as you reaquire an RAF from the netcdfFile cache, because the original RAF was closed (since theres no RAF caching), but h5header had a reference to it.

I would like to refactor, but for now a simple fix thats easily backported is in order.

cwardgar commented 8 years ago

This is still failing for me, both on 4.6 and 5.0. This is even with setting RandomAccessFile.maxFiles = 0.

[cwardgar@MacBook-Pro ~/dev/data/gh-518]
$ ./barf.sh 
Started hammer_coop.sh in the background (PID = 43909).
Sleeping for 15 seconds.
Downloading coordinate variables.
Started hammer_rfc.sh in the background (PID = 45998).
Started hammer_rfc.sh in the background (PID = 45999).
[hammer_coop.sh]: wget returned 8. Aborting.

In threddsServlet.log:

java.lang.AssertionError: shape[2] (385) >= pt[2] (390)
    at ucar.nc2.iosp.hdf5.Tiling.tile(Tiling.java:106) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.Tiling.order(Tiling.java:118) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.Tiling.compare(Tiling.java:133) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    at ucar.nc2.iosp.hdf5.DataBTree$Node.first(DataBTree.java:271) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]
    ...
cwardgar commented 8 years ago

Also, that AssertionError should read shape[2] (385) < pt[2] (390).

JohnLCaron commented 8 years ago

Adding some notes here to remind myself how im running:

1) run tds with

-ea -server -Xmx2g -Dunidata.testdata.path=D: -Dtds.content.root.path=D:/work/blodgett/content

2) cd d:/work/blodgett $ cat barf.sh

!/bin/bash

./hammer_coop.sh & echo "Started hammer_coop.sh in the background (PID = $!)."

echo "Sleeping for 15 seconds." sleep 15

echo "Downloading coordinate variables." wget -q http://localhost:8080/thredds/dodsC/stageiv_archive.nc.dods?time[0:1:999],lon[0:1:1120][0:1:880],lat[0:1:1120][0:1:880] -O temp2.ascii

./hammer_rfc.sh & echo "Started hammer_rfc.sh in the background (PID = $!)."

./hammer_rfc.sh & echo "Started hammer_rfc.sh in the background (PID = $!)."

wait # Wait for background processes to finish.

3) $ cat hammer_coop.sh

!/bin/bash

urlBase=http://localhost:8080/thredds/dodsC/UpperDeschutes_t4p10_swemelt.nc var=.ascii?UpperDeschutes_t4p10_swemelt

for i in $(seq 8394); do

See http://stackoverflow.com/a/24520924/3874643

if ! kill -0 $PPID 2>/dev/null; then echo "[hammer_coop.sh]: Parent process (PPID = $PPID) no longer running. Aborting." exit 1 fi

wget -q $urlBase$var[$i:1:$i][150:1:155][150:1:155] -O temp1.ascii status=$?

if [ $status -ne 0 ] ; then echo "[hammer_coop.sh]: wget returned $status. Aborting." break fi done

4) $ cat hammer_rfc.sh

!/bin/bash

urlBase=http://localhost:8080/thredds/dodsC/stageiv_archive.nc var=.ascii?Total_precipitation_surface_1_Hour_Accumulation

for i in $(seq 999); do

See http://stackoverflow.com/a/24520924/3874643

if ! kill -0 $PPID 2>/dev/null; then echo "[hammer_rfc.sh]: Parent process (PPID = $PPID) no longer running. Aborting." exit 1 fi

wget -q $urlBase$var[$i:1:$i][500:1:550][700:1:750] -O temp3.ascii status=$?

if [ $status -ne 0 ] ; then echo "[hammer_rfc.sh]: wget returned $status. Aborting." break fi done

must run with cygwin, not mingw for wget to be available.

JohnLCaron commented 8 years ago

ok, finally got the exception after 8 minutes. 10K requests:

2016-07-30T18:24:06.622 -0600 [ 903161][ 10123] INFO - threddsServlet - Remote host: 0:0:0:0:0:0:0:1 - Request: "GET /thredds/dodsC/UpperDeschutes_t4p10_swemelt.nc.ascii?UpperDeschutes_t4p10_swemelt[8087:1:8087][150:1:155][150:1:155] HTTP/1.1" 2016-07-30T18:24:06.623 -0600 [ 903162][ 10123] ERROR - thredds.server.opendap.OpendapServlet - request= ReqState: serverClassName: 'null' dataSet: 'UpperDeschutes_t4p10_swemelt.nc' requestSuffix: 'ascii' CE: 'UpperDeschutes_t4p10_swemelt[8087:1:8087][150:1:155][150:1:155]' compressOK: false InitParameters:

java.lang.AssertionError: shape2 >= pt2 at ucar.nc2.iosp.hdf5.Tiling.tile(Tiling.java:106) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.Tiling.order(Tiling.java:118) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.Tiling.compare(Tiling.java:133) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.DataBTree$Node.first(DataBTree.java:271) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.DataBTree$DataChunkIterator.(DataBTree.java:168) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.DataBTree.getDataChunkIteratorFilter(DataBTree.java:92) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.H5tiledLayoutBB.(H5tiledLayoutBB.java:112) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:166) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:145) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2048) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.reallyRead(Variable.java:876) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable._read(Variable.java:861) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.read(Variable.java:672) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.dataset.VariableDS.reallyRead(VariableDS.java:567) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.dataset.VariableDS._read(VariableDS.java:545) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.read(Variable.java:672) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.read(Variable.java:646) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at thredds.server.opendap.NcSDArray.read(NcSDArray.java:115) ~[classes/:5.0.0-SNAPSHOT]

JohnLCaron commented 8 years ago

If I turn off raf caching:

  <RandomAccessFile>
    <maxFiles>0</maxFiles>
  </RandomAccessFile>

should happen sooner (if same thing was happening as last time)...

Noticing that the Exception now being seen java.lang.AssertionError: shape2 >= pt2 is different than i saw before (but cwardgar reported it in this thread).

Happens again at exactly the same call:

2016-07-30T18:36:33.970 -0600 [ 445242][ 10094] INFO - threddsServlet - Remote host: 0:0:0:0:0:0:0:1 - Request: "GET /thredds/dodsC/UpperDeschutes_t4p10_swemelt.nc.ascii?UpperDeschutes_t4p10_swemelt[8087:1:8087][150:1:155][150:1:155] HTTP/1.1" 2016-07-30T18:36:33.971 -0600 [ 445243][ 10094] ERROR - thredds.server.opendap.OpendapServlet - request= ReqState: serverClassName: 'null' dataSet: 'UpperDeschutes_t4p10_swemelt.nc' requestSuffix: 'ascii' CE: 'UpperDeschutes_t4p10_swemelt[8087:1:8087][150:1:155][150:1:155]' compressOK: false InitParameters:

java.lang.AssertionError: shape2 >= pt2 at ucar.nc2.iosp.hdf5.Tiling.tile(Tiling.java:106) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.Tiling.order(Tiling.java:118) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.Tiling.compare(Tiling.java:133) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.DataBTree$Node.first(DataBTree.java:271) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.DataBTree$DataChunkIterator.(DataBTree.java:168) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.DataBTree.getDataChunkIteratorFilter(DataBTree.java:92) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.H5tiledLayoutBB.(H5tiledLayoutBB.java:112) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:166) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.iosp.hdf5.H5iosp.readData(H5iosp.java:145) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2048) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.reallyRead(Variable.java:876) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable._read(Variable.java:861) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.read(Variable.java:672) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.dataset.VariableDS.reallyRead(VariableDS.java:567) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.dataset.VariableDS._read(VariableDS.java:545) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?] at ucar.nc2.Variable.read(Variable.java:672) ~[edu.ucar-cdm-5.0.0-SNAPSHOT.jar:?]

this is for sure a different problem than the raf caching problem.

JohnLCaron commented 8 years ago

commit bc880ec

added to pull - should be trivial to backport, just remove line

  assert shape[i] >= pt[i] : String.format("shape[%s]=(%s) >= pt[%s]=(%s)", i, shape[i], i, pt[i]);

in ucar.nc2.iosp.hdf5.Tiling.

HDF5 consistency check on chunking index not always true. // 7/30/2016 jcaron. Apparently in some cases, at the end of the array, the index can be greater than the shape. // eg cdmUnitTest/formats/netcdf4/UpperDeschutes_t4p10_swemelt.nc // Presumably to have even chunks. Could try to calculate the last even chunk. // For now im removing this consistency check. // assert shape[i] >= pt[i] : String.format("shape[%s]=(%s) should not be less than pt[%s]=(%s)", i, shape[i], i, pt[i]);

JohnLCaron commented 8 years ago
message type = SimpleDataspace(1);  ndims=3 flags=1 type=1  length=(8395,781,385,) max=(8395,781,385,)
message type = Layout(8);  type= 2 (chunked) storageSize = (1,30,30,8) dataSize=0 dataAddress=395837

Level 3: Tile want 2838607
length=           (8395,781,385,)
 Entry=  0 offset [   0,  0,  0,0,]: Tile order       0-177427
 Entry=  1 offset [ 505,390, 90,0,]: Tile order  177427-354867
 Entry=  2 offset [1011,  0,180,0,]: Tile order  354867-532294
 Entry=  3 offset [1516,390,270,0,]: Tile order  532294-709735
 Entry=  4 offset [2022, 30,  0,0,]: Tile order  709735-887038
 Entry=  5 offset [2527,120,270,0,]: Tile order  887038-1064404
 Entry=  6 offset [3032,390, 90,0,]: Tile order 1064404-1241844
 Entry=  7 offset [3538,  0,180,0,]: Tile order 1241844-1419271
 Entry=  8 offset [4043,390,270,0,]: Tile order 1419271-1596712
 Entry=  9 offset [4549, 30,  0,0,]: Tile order 1596712-1774015
 Entry= 10 offset [5054,120,270,0,]: Tile order 1774015-1951381
 Entry= 11 offset [5559,390, 90,0,]: Tile order 1951381-2128821
 Entry= 12 offset [6065,  0,180,0,]: Tile order 2128821-2306248
 Entry= 13 offset [6570,390,270,0,]: Tile order 2306248-2483689
 Entry= 14 offset [7076, 30,  0,0,]: Tile order 2483689-2660992
 Entry= 15 offset [7581,120,270,0,]: Tile order 2660992-2838358
 Entry= 16 offset [8086,390, 90,0,]: Tile order 2838358-2946931
 Entry= 17 offset [8395,630,390,8,]: Tile order 

java.lang.AssertionError: shape[2] (385) >= pt[2] (390)
cwardgar commented 8 years ago

bc880ec9457eede48a77096c15bb91089fdc68c5 is now on 4.6 and 5.0 branches and with it, my barf.sh script completes successfully. I believe this issue is fixed. @dblodgett-usgs Let me know if you discover otherwise.

dblodgett-usgs commented 8 years ago

Will give it a whirl. Thanks for digging into this @cwardgar and @JohnLCaron !!!

dblodgett-usgs commented 8 years ago

Initial testing indicates that this fixed the issue. Will continue to bang on it and see how to go.