Unidata / tds

THREDDS Data Server
https://docs.unidata.ucar.edu/tds/5.0/userguide/index.html
BSD 3-Clause "New" or "Revised" License
64 stars 26 forks source link

Constant but random java.lang.IllegalStateExceptions #520

Closed mkatgert-marin closed 1 month ago

mkatgert-marin commented 2 months ago

We are running Thredds-docker:5.4 and encounter a lot of exceptions when requesting data. The majority is

java.lang.IllegalStateException: DataBTree doesnt start with TREE
    at ucar.nc2.iosp.hdf5.DataBTree$Node.<init>(DataBTree.java:168) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.iosp.hdf5.DataBTree$Node.first(DataBTree.java:246) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.iosp.hdf5.DataBTree$DataChunkIterator.<init>(DataBTree.java:130) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.iosp.hdf5.DataBTree.getDataChunkIteratorFilter(DataBTree.java:67) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5tiledLayoutBB.<init>(H5tiledLayoutBB.java:97) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:222) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:200) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2122) ~[cdm-core-5.5.3.jar:5.5.3

but we also get java.lang.IllegalStateException: java.util.zip.ZipException: invalid distance too far back and java.io.IOException: Invalid argument and java.io.IOException: Negative seek offset and java.lang.IllegalStateException: DataBTree must be type 1 and about 10 other different exceptions. Most of the time, it is sufficient to rerequest the data, sometimes up to three/four times, after which the server responds with the correct response. Sometimes, the problematic response seems to be cached and the problem is solved by restarting the server.

This is the dataset definition we're requesting:

<service name="ncdods" serviceType="OPENDAP" base="/thredds/dodsC/"/>
<dataset name="OWI" urlPath="owi/owi.nc">
    <serviceName>ncdods</serviceName>
    <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
        <aggregation dimName="time" type="joinExisting">
            <scan location="/owi" suffix=".nc" subdirs="false"/>
        </aggregation>
    </netcdf>
</dataset>

All related issues I found should have already been solved in earlier versions (for example https://github.com/Unidata/thredds/issues/518).

The server is heavily used, with ~10-100 requests per second.

tdrwenski commented 2 months ago

We fixed a similar issue in https://github.com/Unidata/netcdf-java/pull/1122 which is in our TDS 5.5 release. Can you test with 5.5 to see if that resolves the issue for you?

mkatgert-marin commented 1 month ago

Hi, thanks for the reply. We've thoroughly tested version 5.5 and have it running in our production environment since this version seems to solve all the bugs we were encountering. Thanks!