Unidata / tds

THREDDS Data Server
https://docs.unidata.ucar.edu/tds/5.0/userguide/index.html
BSD 3-Clause "New" or "Revised" License
65 stars 27 forks source link

WMS service is not using cache for S3 objects #311

Closed WeatherGod closed 11 months ago

WeatherGod commented 2 years ago

Using v5.4 (with the sts patch).

Been hunting down performance issues serving WMS layers for certain datasets served from S3 for the past couple of days (both memory and time are much too much). I added some logging configs for awssdk and discovered that to load a single moment in time the same file on S3 was being fetched 360 times for one of my datasets, and 11696 for another higher resolution dataset!

WeatherGod commented 2 years ago

I can confirm that the catalog itself is getting cached and used, although a possible sub-optimal behavior still exists there because if a WMS client like openlayers hits thredds with an initial round of GetMap requests, the WMS servlet will find that the catalog doesn't exist in the cache for the first 10 or so requests and make it 10 times before it finally lands in the cache for the rest of the requests to utilize.

WeatherGod commented 2 years ago

Also, I don't know for sure if this is limited to S3. I haven't seen WMS requests result in much data landing in the cache directories whether or not the data was served from S3 or the local filesystem.

tdrwenski commented 2 years ago

Thanks for reporting this. We recently made a change to the WMS cache which may be causing these issues, we will have a look!

WeatherGod commented 2 years ago

How recently? Because this is happening while using a slightly modified v5.4 dockerhub image. Please let me know what else I can do to help debug and fix this as we are taking a huge performance hit trying to serve up high-res NDVI data.

WeatherGod commented 2 years ago

In case it matters, the datasets in question are joinNew aggregations.

tdrwenski commented 2 years ago

I think the change that probably caused this was merged shortly before the STS fix (https://github.com/Unidata/tds/pull/294). I will try to have a look at soon as possible and will let you know if I need more info!

WeatherGod commented 2 years ago

Looks like that was about the catalog cache. According to my logging, that is working (modulo a sub-optimal, mostly benign race condition I noted above). This seems to be an issue with the individual files that make up the aggregation isn't getting properly cached (either cache building or cache fetching).

WeatherGod commented 2 years ago

I can't find any good guidance on how to turn on logging of the netcdf-java components. In particular, I think it would be useful to know whether or not open_dataset() or acquire_dataset() is being used. As I understand it, the latter utilizes the library's caching mechanism. I tried some naive settings for the log4j2.xml file, but I didn't get any output from netcdf-java.

WeatherGod commented 2 years ago

(sorry, openDataset() vs. acquireDataset(), not quite used to camel-casing...)

tdrwenski commented 2 years ago

You can try to change the log level from "info" to "debug" in log4j2.xml but it won't necessarily give more info, depending on what we decided to write to log. There's some info about logs here: https://docs.unidata.ucar.edu/tds/current/userguide/tds_logs.html

You can also see what files are in the caches in the admin service: https://docs.unidata.ucar.edu/tds/current/userguide/remote_management_ref.html#tds-remote-debugging

A few questions:

WeatherGod commented 2 years ago

Yes, there is a scan element. I'll make clear that the problem isn't with the indexing, as that is somewhat quick since I use a dateFormatMark. I have not tried any other service yet. I'll get stuff spun back up shortly to try out some more logging.

To confirm what's going on with the catalog cache, I used the following patch and compiled the v5.4 release of the code.

diff --git a/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java b/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java
index b6df960409..f610f9c26a 100644
--- a/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java
+++ b/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java
@@ -40,6 +40,8 @@ import java.util.HashMap;
 import java.util.Map;
 import javax.servlet.http.HttpServletRequest;
 import javax.servlet.http.HttpServletResponse;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 import thredds.core.TdsRequestedDataset;
 import ucar.nc2.NetcdfFile;
 import ucar.nc2.dataset.NetcdfDataset;
@@ -61,6 +63,7 @@ import ucar.nc2.dataset.NetcdfDataset;
 public class ThreddsWmsServlet extends WmsServlet {

   private static final Map<String, WmsCatalogue> catalogueCache = new HashMap<>();
+  static private final Logger logger = LoggerFactory.getLogger(ThreddsWmsServlet.class);

   @Override
   @RequestMapping(value = "**", method = {RequestMethod.GET})
@@ -82,8 +85,10 @@ public class ThreddsWmsServlet extends WmsServlet {
     String removePrefix = null;
     TdsRequestedDataset tdsDataset = new TdsRequestedDataset(httpServletRequest, removePrefix);
     if (catalogueCache.containsKey(tdsDataset.getPath())) {
+      logger.info("Found " + tdsDataset.getPath() + " in the cache");
       catalogue = catalogueCache.get(tdsDataset.getPath());
     } else {
+      logger.info("Did not find " + tdsDataset.getPath() + " in the cache");
       NetcdfFile ncf = TdsRequestedDataset.getNetcdfFile(httpServletRequest, httpServletResponse, tdsDataset.getPath());
       NetcdfDataset ncd;
       if (TdsRequestedDataset.useNetcdfJavaBuilders()) {
@@ -109,6 +114,7 @@ public class ThreddsWmsServlet extends WmsServlet {
       if (netcdfFilePath == null) {
         throw new EdalLayerNotFoundException("The requested dataset is not available on this server");
       }
+      logger.info("Creating WMS Catalog for " + tdsDataset.getPath());
       catalogue = new ThreddsWmsCatalogue(ncd, tdsDataset.getPath());
       catalogueCache.put(tdsDataset.getPath(), catalogue);
     }

And added appropriate entries to the log4j2.xml. To see the extraneous S3 fetches, add the following to the log4js2.xml file (I know you know this already, but including it here for completeness):

    <File name="awssdk" fileName="${tds.log.dir}/awssdk.log" append="false">
      <PatternLayout pattern="%d %c: %m%n"/>
    </File>
    <logger name="software.amazon.awssdk" level="debug" additivity="false">
      <appender-ref ref="awssdk"/>
    </logger>

    <logger name="software.amazon.awssdk.request" level="debug" additivity="false">
      <appender-ref ref="awssdk"/>
    </logger>

Here is one of my entries. Unfortunately, the bucket isn't publicly available:

  <dataset name="MODIS Terra NDVI 2.5km" ID="terra_ndvi_2500m" urlPath="ndvi/terra/ndvi_2500m">
    <metadata inherited="true">
      <serviceName>all</serviceName>
      <dataType>Grid</dataType>
    </metadata>
    <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
      <variable name="ndvi">
        <remove name="_ChunkSizes" type="attribute"/>
      </variable>
      <aggregation dimName="time" type="joinNew" recheckEvery="15 min" >
        <variableAgg name="ndvi"/>
        <scan location="cdms3:aer-h2o-data?test/gimms/ndvi/#delimiter=/"
            dateFormatMark="GMOD09Q1.A#yyyyDDD" subdirs="false"
            regExp="GMOD09Q1.A[0-9]{7}.*\.nc$" />
      </aggregation>
    </netcdf>
  </dataset>

When I get the server spun back up, I'll post some of my logging results.

WeatherGod commented 2 years ago

Counting the number of times the same file gets fetched for a set of GetMap calls to display a single moment in time:

root@8bb99281bafb:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | tail -n 1
2022-10-27 21:05:09,566 software.amazon.awssdk.request: Sending Request: DefaultSdkHttpFullRequest(httpMethod=GET, protocol=https, host=aer-h2o-data.s3.amazonaws.com, port=443, encodedPath=/test/gimms/ndvi/GMOD09Q1.A2022249.08d.latlon.global.061.NDVI.subsampled.nc, headers=[amz-sdk-invocation-id, Range, User-Agent, x-amz-te], queryParameters=[])
root@8bb99281bafb:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022249 | wc -l
8749

(It still isn't done, and it has been several minutes).

Contents of the cache:

root@8bb99281bafb:/usr/local/tomcat# du -hsc content/thredds/cache/*
0   content/thredds/cache/agg
0   content/thredds/cache/aggNew
9.7M    content/thredds/cache/catalog
0   content/thredds/cache/cdm
5.9M    content/thredds/cache/collection
26M content/thredds/cache/edal-java
0   content/thredds/cache/grib
0   content/thredds/cache/ncss
0   content/thredds/cache/wcs
41M total
root@8bb99281bafb:/usr/local/tomcat# ls -l content/thredds/cache/catalog/
total 9872
-rw-r----- 1 tomcat tomcat       28 Oct 27 20:53 catTracker.dat
-rw-r----- 1 tomcat tomcat 10100736 Oct 27 20:54 chronicle.datasets.dat.1
-rw-r----- 1 tomcat tomcat        0 Oct 27 20:53 datarootTracker.dat
root@8bb99281bafb:/usr/local/tomcat# ls -l content/thredds/cache/collection/
total 6040
-rw-r----- 1 tomcat tomcat 6184960 Oct 27 20:53 GridDatasetInv.dat
root@8bb99281bafb:/usr/local/tomcat# ls -l content/thredds/cache/edal-java/
total 0
drwxr-x--- 3 tomcat tomcat 16 Oct 27 20:53 epsg
WeatherGod commented 2 years ago

And proof that the catalogue cache is being found and used, a snippet for a similar map layer:

root@f1b2af09dbaa:/usr/local/tomcat# grep ThreddsWmsServlet content/thredds/logs/threddsServlet.log 
2022-10-27T21:21:55.929 +0000 [    338180][      33] INFO  - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:55.954 +0000 [    338205][      34] INFO  - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:56.014 +0000 [    338265][      35] INFO  - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:56.014 +0000 [    338265][      37] INFO  - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:56.014 +0000 [    338265][      36] INFO  - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:57.579 +0000 [    339830][      34] INFO  - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [    339830][      37] INFO  - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [    339830][      33] INFO  - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [    339830][      35] INFO  - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [    339830][      36] INFO  - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:22:04.371 +0000 [    346622][      39] INFO  - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.374 +0000 [    346625][      40] INFO  - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.385 +0000 [    346636][      41] INFO  - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.387 +0000 [    346638][      42] INFO  - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.388 +0000 [    346639][      43] INFO  - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache

Notice the sub-optimal behavior I pointed out before. Several catalogues are created in the initial flurry of requests, but eventually the cached one is found.

WeatherGod commented 2 years ago

As for the question about whether or not this problem is limited to WMS, it looks like it isn't as bad, but it might still be happening. In the following greps of my logs, I grep the awssdk.log for the number of times a particular file showed up, and it comes up zero, indicating that I haven't made a request for it yet. I then did a WCS GetCoverage call for a bounding box of the data on a particular day. I then do the same grep again.

root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022249 | wc -l
12282
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022241 | wc -l
6711
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022233 | wc -l
0
... Do Single WCS request ...
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022249 | wc -l
12323
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022241 | wc -l
6711
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022233 | wc -l
679

I wonder if I might be misinterpreting the httpMethod=GET entries as a fetch of the entire file as opposed to fetches for chunks. I don't know the details of this. I will point out that this WCS GetCoverage call was significantly faster (~5 seconds) than openlayers building a map of the same area (+5 minutes). Also notice that the count for doy 249 increased after the WCS call. That particular day is the penultimate day in our dataset, so that might be from some sort of indexing action?

WeatherGod commented 2 years ago

Continuing my debugging, I have added some logging to the S3-related classes in netcdf-java. One thing I noticed is that just querying for the Ncml document for the catalog results in a new scan of the directories in S3. I'm still trying to figure out what code actually triggers the S3 object fetch. I would have thought it would be something in the MFileS3 class, but my logs do not indicate that that is used to actually fetch the data.

WeatherGod commented 2 years ago

So, I have found a couple of bugs in netcdf-java. Fixing them doesn't solve our problem here, but I'll submit the fixes shortly anyway.

tdrwenski commented 2 years ago

Thanks for looking into this! We always appreciate help with bug fixes!

I think I see a problem that the S3RandomAccessFile is sometimes passed in the default_buffersize = 8092 from NetcdfFiles.java, whereas other cases it will use its own much more reasonable default of defaultRemoteFileBufferSize = 262144 (256 KB). This might be the reason why there are insanely large numbers of S3 requests! I hope to have a fix soon and would definitely appreciate your help testing it.

WeatherGod commented 2 years ago

I opened https://github.com/Unidata/netcdf-java/pull/1105 with the fixes for the bugs I noticed. One of them impacts the client cache, which might help us slightly here.

Your theory on the blocksizes makes sense and definitely worth testing out. It doesn't explain everything, but given how each request has to go through authentication for me, there is definitely a lot of overhead with each of those requests. Separately, we still need to figure out why thredds keeps wanting to redownload the same file over and over. Hopefully, that will be easier to figure out with some cleaner behavior elsewhere in the toolchain.

tdrwenski commented 1 year ago

The fix that you tested from https://github.com/Unidata/netcdf-java/pull/1106 is in the latest snapshot and docker image.

Now that the performance is somewhat improved, are you still having issues with WMS or should we close this issue?

WeatherGod commented 1 year ago

Things are better, but caches are still not being utilized and memory usage still grows as more requests are made. I am currently doing some more testing and logging to determine if the problem is in netcdf-java, tds, or ncWMS.

It is possible that there are two inter-related issues. One being a caching issue in netcdf-java, and the other being an issue with ncWMS trying (and failing) to manage its own cache (the code isn't quite clear at the moment). So, right now, I'm trying to constrain the problem by exercising the WCS endpoint since it is relatively simple compared to WMS and therefore easier to track down certain problems.

WeatherGod commented 1 year ago

Turned up the logging in FileCache.java for netcdf-java, and there is definitely something weird going on:

[2022-11-03T17:41:35.568+0000] FileCache NetcdfFileCache  acquire cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/
[2022-11-03T17:41:35.570+0000] FileCache NetcdfFileCache  add to cache cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/
[2022-11-03T17:41:35.570+0000] CacheElement add to cache cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ NetcdfFileCache 
[2022-11-03T17:41:35.577+0000] FileCache NetcdfFileCache  release cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/; hash= 194111868
[2022-11-03T17:41:35.708+0000] FileCache NetcdfFileCache  acquire from cache cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ failed: cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter= (No such file or directory)
[2022-11-03T17:41:35.708+0000] FileCache NetcdfFileCache  remove cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/

Notice that the final / is missing for the IOException's message. I'm still tracing things down, but my suspicion is that there is code that is naively treating the location as a path and is stripping the final slash, thinking it is indicating a directory, maybe? Of course, the bigger question is why does it think it is a filepath and not a URL with query parameters...

tdrwenski commented 1 year ago

Thanks for the extra debugging help! I had a quick look and indeed it looks like the NetcdfFileCache is not working correctly for S3. It does put S3 file in the cache but cannot use it again. It looks to me like this is because the cache tries to reacquire by using a RandomAccessFile, which only works for local files, and which is throwing the IOException you mentioned. Instead it probably should use NetcdfFiles::getRaf to create the appropriate type of random access file which would work for S3. I will have a look how to test and fix this!

WeatherGod commented 1 year ago

Could the solution be to implement the reacquire() method in S3RandomAccessFile or even RemoteRandomAccessFile? Sorry if this doesn't make sense, my understanding of polymorphism is rooted in python.

WeatherGod commented 1 year ago

As for testing, I noticed that the FileCache object keeps miss and hit counts. So one could write up a test that fetches a particular S3 Dataset twice, and checks the number of hits and misses after that.

WeatherGod commented 1 year ago

Oh, I think I'm starting to see the flow. From my logging of methods and getClass()'s

FileCache.acquireCacheOnly() calls NetcdfFile.reacquire()
NetcdfFile.reacquire() calls H5iospNew.reacquire()
H5iospNew.reacquire() calls super.reacquire()
super refers to AbstractIOServiceProvider, I believe
AbstractIOServiceProvider.reacquire() calls RandomAccessFile.reacquire() directly without knowing what the original object that triggered the reacquire came from.

Without me fully understanding what the IOServiceProvider is supposed to do, perhaps the NetcdfFile object's iosp is where we could place some logic that recognizes which reacquire() (and other related methods()) to call?

WeatherGod commented 1 year ago

Oh, wait, slightly incorrect:

AbstractIOServiceProvider.reacquire() calls a one-parameter version of reacquire(location), which isn't implemented in RandomAccessFile. I can't find this anywhere...

WeatherGod commented 1 year ago

Gosh, sorry, my eyes are getting crossed... AbstractIOServiceProvider.reacquire() calls RandomAccessFile.acquire(location) (acquire vs. reacquire).

WeatherGod commented 1 year ago

Ok, a possible design limitation, maybe?

In RandomAccessFile.java

  public static RandomAccessFile acquire(String location) throws IOException {
    if (cache == null) {
      log.debug("No cache, building RandomAccessFile from {}", location);
      return new RandomAccessFile(location, "r");
    } else {
      log.debug("There is a cache. Acquiring using factory {} for location {} from {}", factory, location, cache);
      return (RandomAccessFile) cache.acquire(factory, DatasetUrl.create(null, location));
    }
  }

(logging is mine)

This is a static method utilizing a static factory member. This FileFactory.open() is set to always return a RandomAccessFile. Meanwhile, RemoteRandomAccessFile doesn't override this factory instance, nor does it override acquire(). So even if we were to be able to get RemoteRandomAccessFile.acquire() to be called, it'll still be working with a regular RandomAccessFile instance.

WeatherGod commented 1 year ago

I think I'm at my limit of understanding how the caching mechanism is supposed to work, so I'm going to back off now. Let me know if there is anything you'd like me to try out or suggestions for additional logging, or anything else that would help you in figuring out a fix for this.

WeatherGod commented 1 year ago

One thing that is a nagging question in my mind is why the cache is even doing acquisitions in the first place. The NetcdfFile instance is stored and retrieved from the Cache, so I don't understand what more is needed to be done.

WeatherGod commented 1 year ago

Looking at this again with fresh eyes, there might be some sort of semi-recursion or something going on. If I am reading this right. The following is for a NetcdfDatasets.acquireFile() call after the initial one done for inventorying the coordinates. So, there is a cache hit at first.

The only thing different the second time through (actually, third time, but who's counting?) is that for some reason, it is a cache miss.

After the cache miss, it then goes and does a straight-up open, but it uses the RandomAccessFile as the factory class instead of NetcdfDatasets.StandardFileFactory.

So, it seems like there are two problems... maybe even three:

  1. Why does FileCache.acquireCacheOnly() end up calling itself in a round-about way? Could this get short-circuited?
  2. Why is it a cache miss when acquireCacheOnly() is recursing?
  3. Maybe the cache should store the factory within the FileCacheable, so that if an acquire() after a hit fails, it can properly rebuild the object using the factory?
WeatherGod commented 1 year ago

To answer my question earlier about why one would need to do a re-acquire in the first place even though we have the NetcdfFile object, it looks like it is necessary to re-establish the file object that backs the NetcdfFile object. If I comment out the acquire() that is called for "rehydrating" the netcdffile object, I'll eventually get a NullPointerException:

ava.lang.NullPointerException: null
    at ucar.nc2.iosp.IospHelper.readData(IospHelper.java:116) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.iosp.IospHelper.readDataFill(IospHelper.java:58) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5iospNew.readDataPrimitive(H5iospNew.java:576) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:390) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:264) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:200) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2124) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.Variable.reallyRead(Variable.java:797) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.Variable._read(Variable.java:736) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.Variable.read(Variable.java:614) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.internal.ncml.AggProxyReader.reallyRead(AggProxyReader.java:37) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.dataset.VariableDS._read(VariableDS.java:432) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.Variable.read(Variable.java:614) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.dataset.CoordinateAxis1D.readValues(CoordinateAxis1D.java:860) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.dataset.CoordinateAxis1D.doRead(CoordinateAxis1D.java:757) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.dataset.CoordinateAxis1D.getCoordValue(CoordinateAxis1D.java:249) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.ft2.coverage.adapter.DtCoverageAdapter.makeCoordAxis(DtCoverageAdapter.java:218) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.ft2.coverage.adapter.DtCoverageAdapter.factory(DtCoverageAdapter.java:64) ~[cdm-core-5.5.3.jar:5.5.3]
    at ucar.nc2.ft2.coverage.CoverageDatasetFactory.openNcmlString(CoverageDatasetFactory.java:142) ~[cdm-core-5.5.3.jar:5.5.3]
    at thredds.core.DatasetManager.openCoverageDataset(DatasetManager.java:429) ~[classes/:5.4]

The raf variable was null.

This makes me wonder how would caching work for an S3-backed NetcdfFile. If the code is aggressively closing the NetcdfFile object, thereby dropping their (what I assume is a) S3RandomAccessFile instance, then fixing this reacquire() problem would only get us partway there because the data itself still isn't cached and would have to be re-fetched from S3 (assuming I am understanding everything correctly, which is a huge assumption). How does this work for a local file? Is reacquire() re-establishing the file object and re-reading data from the disk?

WeatherGod commented 1 year ago

Could the solution to this problem be to have an "cdms3://" version of NetcdfFileProvider, much like the CdmRemoteNetcdfFileProvider class and supporting classes? I'm not fully understanding what exactly it is used for, but if one existed and was registered for cdms3:// addresses, then it would intercede in the file acquisition process before FileCache.acquireCacheOnly(). Maybe be able to provide a proper spiObject instead of null?

WeatherGod commented 1 year ago

Actually, that idea of using NetcdfFileProvider seems to be an abuse of the system. I'm now looking back to what you originally suggested, using NetcdfFiles::getRaf(). I think I figured out the right location for it, in AbstractIOServiceProvider::reacquire().

WeatherGod commented 1 year ago

Ugh, there's a snag. ncfile is of type NetcdfFile, which has a private getRaf() method, which we can't call from AbstractIOServiceProvider::reacquire(). NetcdfFiles (plural) has a public one, but I don't think it would get us the proper RAF anyway. In fact, I'm trying to figure out how the proper RAF gets determined in the first place.

WeatherGod commented 1 year ago

Oh, wait! NetcdfFiles::getRaf() is a static public method! I could use that regardless of what class ncfile is!

WeatherGod commented 1 year ago

That worked! The cache is a bit more better behaved now. The memory seems to grow a bit more slowly, and we are getting far fewer cache misses. It is still re-fetching the S3 data at about the same rate, as I suspected a few posts ago, but I consider this another step forward. I'll see if I can design a unit test and submit a PR.

WeatherGod commented 1 year ago

Opened https://github.com/Unidata/netcdf-java/pull/1111

WeatherGod commented 1 year ago

With https://github.com/Unidata/netcdf-java/pull/1111 merged in, we should start seeing better cache utilization in the WMS (and other) components. My next target for better cache utilization is to see if we can get WMS to use the enhanced dataset object that netcdf-java already put together rather than doing it again for its version of the gridded dataset class. This would cut down significantly in extraneous reads, especially when the dataset requires reading the coordinate data.

Another possible area of improvement is to see if WMS really needs to keep opening and closing the files while building the dataset. It will open a file (requires a HEAD operation for S3 and a fetch of the first block every time), read a single coordinate variable (another fetch, usually), then close the file. Then open the file again (new HEAD and fetch), and read the next coordinate variable (usually in the same block as the other, but that's gone since closing the file drops the remote read cache). Another way to approach this problem is to allow the remote read cache to persist between reacquires.

tdrwenski commented 11 months ago

I believe we fixed several performance issues and the issue related to the cache not being used for S3, so I am going to close this. It sounds like you still have other ideas for improvements-- feel free to put them in a new issue.