Closed WeatherGod closed 11 months ago
I can confirm that the catalog itself is getting cached and used, although a possible sub-optimal behavior still exists there because if a WMS client like openlayers hits thredds with an initial round of GetMap requests, the WMS servlet will find that the catalog doesn't exist in the cache for the first 10 or so requests and make it 10 times before it finally lands in the cache for the rest of the requests to utilize.
Also, I don't know for sure if this is limited to S3. I haven't seen WMS requests result in much data landing in the cache directories whether or not the data was served from S3 or the local filesystem.
Thanks for reporting this. We recently made a change to the WMS cache which may be causing these issues, we will have a look!
How recently? Because this is happening while using a slightly modified v5.4 dockerhub image. Please let me know what else I can do to help debug and fix this as we are taking a huge performance hit trying to serve up high-res NDVI data.
In case it matters, the datasets in question are joinNew aggregations.
I think the change that probably caused this was merged shortly before the STS fix (https://github.com/Unidata/tds/pull/294). I will try to have a look at soon as possible and will let you know if I need more info!
Looks like that was about the catalog cache. According to my logging, that is working (modulo a sub-optimal, mostly benign race condition I noted above). This seems to be an issue with the individual files that make up the aggregation isn't getting properly cached (either cache building or cache fetching).
I can't find any good guidance on how to turn on logging of the netcdf-java components. In particular, I think it would be useful to know whether or not open_dataset()
or acquire_dataset()
is being used. As I understand it, the latter utilizes the library's caching mechanism. I tried some naive settings for the log4j2.xml file, but I didn't get any output from netcdf-java.
(sorry, openDataset()
vs. acquireDataset()
, not quite used to camel-casing...)
You can try to change the log level from "info" to "debug" in log4j2.xml but it won't necessarily give more info, depending on what we decided to write to log. There's some info about logs here: https://docs.unidata.ucar.edu/tds/current/userguide/tds_logs.html
You can also see what files are in the caches in the admin service: https://docs.unidata.ucar.edu/tds/current/userguide/remote_management_ref.html#tds-remote-debugging
A few questions:
Yes, there is a scan element. I'll make clear that the problem isn't with the indexing, as that is somewhat quick since I use a dateFormatMark. I have not tried any other service yet. I'll get stuff spun back up shortly to try out some more logging.
To confirm what's going on with the catalog cache, I used the following patch and compiled the v5.4 release of the code.
diff --git a/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java b/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java
index b6df960409..f610f9c26a 100644
--- a/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java
+++ b/tds/src/main/java/thredds/server/wms/ThreddsWmsServlet.java
@@ -40,6 +40,8 @@ import java.util.HashMap;
import java.util.Map;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
import thredds.core.TdsRequestedDataset;
import ucar.nc2.NetcdfFile;
import ucar.nc2.dataset.NetcdfDataset;
@@ -61,6 +63,7 @@ import ucar.nc2.dataset.NetcdfDataset;
public class ThreddsWmsServlet extends WmsServlet {
private static final Map<String, WmsCatalogue> catalogueCache = new HashMap<>();
+ static private final Logger logger = LoggerFactory.getLogger(ThreddsWmsServlet.class);
@Override
@RequestMapping(value = "**", method = {RequestMethod.GET})
@@ -82,8 +85,10 @@ public class ThreddsWmsServlet extends WmsServlet {
String removePrefix = null;
TdsRequestedDataset tdsDataset = new TdsRequestedDataset(httpServletRequest, removePrefix);
if (catalogueCache.containsKey(tdsDataset.getPath())) {
+ logger.info("Found " + tdsDataset.getPath() + " in the cache");
catalogue = catalogueCache.get(tdsDataset.getPath());
} else {
+ logger.info("Did not find " + tdsDataset.getPath() + " in the cache");
NetcdfFile ncf = TdsRequestedDataset.getNetcdfFile(httpServletRequest, httpServletResponse, tdsDataset.getPath());
NetcdfDataset ncd;
if (TdsRequestedDataset.useNetcdfJavaBuilders()) {
@@ -109,6 +114,7 @@ public class ThreddsWmsServlet extends WmsServlet {
if (netcdfFilePath == null) {
throw new EdalLayerNotFoundException("The requested dataset is not available on this server");
}
+ logger.info("Creating WMS Catalog for " + tdsDataset.getPath());
catalogue = new ThreddsWmsCatalogue(ncd, tdsDataset.getPath());
catalogueCache.put(tdsDataset.getPath(), catalogue);
}
And added appropriate entries to the log4j2.xml. To see the extraneous S3 fetches, add the following to the log4js2.xml file (I know you know this already, but including it here for completeness):
<File name="awssdk" fileName="${tds.log.dir}/awssdk.log" append="false">
<PatternLayout pattern="%d %c: %m%n"/>
</File>
<logger name="software.amazon.awssdk" level="debug" additivity="false">
<appender-ref ref="awssdk"/>
</logger>
<logger name="software.amazon.awssdk.request" level="debug" additivity="false">
<appender-ref ref="awssdk"/>
</logger>
Here is one of my entries. Unfortunately, the bucket isn't publicly available:
<dataset name="MODIS Terra NDVI 2.5km" ID="terra_ndvi_2500m" urlPath="ndvi/terra/ndvi_2500m">
<metadata inherited="true">
<serviceName>all</serviceName>
<dataType>Grid</dataType>
</metadata>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
<variable name="ndvi">
<remove name="_ChunkSizes" type="attribute"/>
</variable>
<aggregation dimName="time" type="joinNew" recheckEvery="15 min" >
<variableAgg name="ndvi"/>
<scan location="cdms3:aer-h2o-data?test/gimms/ndvi/#delimiter=/"
dateFormatMark="GMOD09Q1.A#yyyyDDD" subdirs="false"
regExp="GMOD09Q1.A[0-9]{7}.*\.nc$" />
</aggregation>
</netcdf>
</dataset>
When I get the server spun back up, I'll post some of my logging results.
Counting the number of times the same file gets fetched for a set of GetMap calls to display a single moment in time:
root@8bb99281bafb:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | tail -n 1
2022-10-27 21:05:09,566 software.amazon.awssdk.request: Sending Request: DefaultSdkHttpFullRequest(httpMethod=GET, protocol=https, host=aer-h2o-data.s3.amazonaws.com, port=443, encodedPath=/test/gimms/ndvi/GMOD09Q1.A2022249.08d.latlon.global.061.NDVI.subsampled.nc, headers=[amz-sdk-invocation-id, Range, User-Agent, x-amz-te], queryParameters=[])
root@8bb99281bafb:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022249 | wc -l
8749
(It still isn't done, and it has been several minutes).
Contents of the cache:
root@8bb99281bafb:/usr/local/tomcat# du -hsc content/thredds/cache/*
0 content/thredds/cache/agg
0 content/thredds/cache/aggNew
9.7M content/thredds/cache/catalog
0 content/thredds/cache/cdm
5.9M content/thredds/cache/collection
26M content/thredds/cache/edal-java
0 content/thredds/cache/grib
0 content/thredds/cache/ncss
0 content/thredds/cache/wcs
41M total
root@8bb99281bafb:/usr/local/tomcat# ls -l content/thredds/cache/catalog/
total 9872
-rw-r----- 1 tomcat tomcat 28 Oct 27 20:53 catTracker.dat
-rw-r----- 1 tomcat tomcat 10100736 Oct 27 20:54 chronicle.datasets.dat.1
-rw-r----- 1 tomcat tomcat 0 Oct 27 20:53 datarootTracker.dat
root@8bb99281bafb:/usr/local/tomcat# ls -l content/thredds/cache/collection/
total 6040
-rw-r----- 1 tomcat tomcat 6184960 Oct 27 20:53 GridDatasetInv.dat
root@8bb99281bafb:/usr/local/tomcat# ls -l content/thredds/cache/edal-java/
total 0
drwxr-x--- 3 tomcat tomcat 16 Oct 27 20:53 epsg
And proof that the catalogue cache is being found and used, a snippet for a similar map layer:
root@f1b2af09dbaa:/usr/local/tomcat# grep ThreddsWmsServlet content/thredds/logs/threddsServlet.log
2022-10-27T21:21:55.929 +0000 [ 338180][ 33] INFO - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:55.954 +0000 [ 338205][ 34] INFO - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:56.014 +0000 [ 338265][ 35] INFO - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:56.014 +0000 [ 338265][ 37] INFO - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:56.014 +0000 [ 338265][ 36] INFO - thredds.server.wms.ThreddsWmsServlet - Did not find ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:21:57.579 +0000 [ 339830][ 34] INFO - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [ 339830][ 37] INFO - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [ 339830][ 33] INFO - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [ 339830][ 35] INFO - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:21:57.579 +0000 [ 339830][ 36] INFO - thredds.server.wms.ThreddsWmsServlet - Creating WMS Catalog for ndvi/terra/ndvi_mean_2500m
2022-10-27T21:22:04.371 +0000 [ 346622][ 39] INFO - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.374 +0000 [ 346625][ 40] INFO - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.385 +0000 [ 346636][ 41] INFO - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.387 +0000 [ 346638][ 42] INFO - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
2022-10-27T21:22:04.388 +0000 [ 346639][ 43] INFO - thredds.server.wms.ThreddsWmsServlet - Found ndvi/terra/ndvi_mean_2500m in the cache
Notice the sub-optimal behavior I pointed out before. Several catalogues are created in the initial flurry of requests, but eventually the cached one is found.
As for the question about whether or not this problem is limited to WMS, it looks like it isn't as bad, but it might still be happening. In the following greps of my logs, I grep the awssdk.log for the number of times a particular file showed up, and it comes up zero, indicating that I haven't made a request for it yet. I then did a WCS GetCoverage call for a bounding box of the data on a particular day. I then do the same grep again.
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022249 | wc -l
12282
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022241 | wc -l
6711
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022233 | wc -l
0
... Do Single WCS request ...
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022249 | wc -l
12323
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022241 | wc -l
6711
root@af9a09352733:/usr/local/tomcat# grep httpMethod=GET content/thredds/logs/awssdk.log | grep GMOD09Q1.A2022233 | wc -l
679
I wonder if I might be misinterpreting the httpMethod=GET entries as a fetch of the entire file as opposed to fetches for chunks. I don't know the details of this. I will point out that this WCS GetCoverage call was significantly faster (~5 seconds) than openlayers building a map of the same area (+5 minutes). Also notice that the count for doy 249 increased after the WCS call. That particular day is the penultimate day in our dataset, so that might be from some sort of indexing action?
Continuing my debugging, I have added some logging to the S3-related classes in netcdf-java. One thing I noticed is that just querying for the Ncml document for the catalog results in a new scan of the directories in S3. I'm still trying to figure out what code actually triggers the S3 object fetch. I would have thought it would be something in the MFileS3 class, but my logs do not indicate that that is used to actually fetch the data.
So, I have found a couple of bugs in netcdf-java. Fixing them doesn't solve our problem here, but I'll submit the fixes shortly anyway.
Thanks for looking into this! We always appreciate help with bug fixes!
I think I see a problem that the S3RandomAccessFile
is sometimes passed in the default_buffersize = 8092
from NetcdfFiles.java, whereas other cases it will use its own much more reasonable default of defaultRemoteFileBufferSize = 262144
(256 KB). This might be the reason why there are insanely large numbers of S3 requests! I hope to have a fix soon and would definitely appreciate your help testing it.
I opened https://github.com/Unidata/netcdf-java/pull/1105 with the fixes for the bugs I noticed. One of them impacts the client cache, which might help us slightly here.
Your theory on the blocksizes makes sense and definitely worth testing out. It doesn't explain everything, but given how each request has to go through authentication for me, there is definitely a lot of overhead with each of those requests. Separately, we still need to figure out why thredds keeps wanting to redownload the same file over and over. Hopefully, that will be easier to figure out with some cleaner behavior elsewhere in the toolchain.
The fix that you tested from https://github.com/Unidata/netcdf-java/pull/1106 is in the latest snapshot and docker image.
Now that the performance is somewhat improved, are you still having issues with WMS or should we close this issue?
Things are better, but caches are still not being utilized and memory usage still grows as more requests are made. I am currently doing some more testing and logging to determine if the problem is in netcdf-java, tds, or ncWMS.
It is possible that there are two inter-related issues. One being a caching issue in netcdf-java, and the other being an issue with ncWMS trying (and failing) to manage its own cache (the code isn't quite clear at the moment). So, right now, I'm trying to constrain the problem by exercising the WCS endpoint since it is relatively simple compared to WMS and therefore easier to track down certain problems.
Turned up the logging in FileCache.java for netcdf-java, and there is definitely something weird going on:
[2022-11-03T17:41:35.568+0000] FileCache NetcdfFileCache acquire cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/
[2022-11-03T17:41:35.570+0000] FileCache NetcdfFileCache add to cache cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/
[2022-11-03T17:41:35.570+0000] CacheElement add to cache cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ NetcdfFileCache
[2022-11-03T17:41:35.577+0000] FileCache NetcdfFileCache release cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/; hash= 194111868
[2022-11-03T17:41:35.708+0000] FileCache NetcdfFileCache acquire from cache cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/ failed: cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter= (No such file or directory)
[2022-11-03T17:41:35.708+0000] FileCache NetcdfFileCache remove cdms3:aer-h2o-data?test/drought/2022/GDMv4_202208.nc#delimiter=/
Notice that the final /
is missing for the IOException's message. I'm still tracing things down, but my suspicion is that there is code that is naively treating the location as a path and is stripping the final slash, thinking it is indicating a directory, maybe? Of course, the bigger question is why does it think it is a filepath and not a URL with query parameters...
Thanks for the extra debugging help! I had a quick look and indeed it looks like the NetcdfFileCache is not working correctly for S3. It does put S3 file in the cache but cannot use it again. It looks to me like this is because the cache tries to reacquire
by using a RandomAccessFile
, which only works for local files, and which is throwing the IOException you mentioned. Instead it probably should use NetcdfFiles::getRaf
to create the appropriate type of random access file which would work for S3. I will have a look how to test and fix this!
Could the solution be to implement the reacquire()
method in S3RandomAccessFile
or even RemoteRandomAccessFile
? Sorry if this doesn't make sense, my understanding of polymorphism is rooted in python.
As for testing, I noticed that the FileCache object keeps miss and hit counts. So one could write up a test that fetches a particular S3 Dataset twice, and checks the number of hits and misses after that.
Oh, I think I'm starting to see the flow. From my logging of methods and getClass()'s
FileCache.acquireCacheOnly()
calls NetcdfFile.reacquire()
NetcdfFile.reacquire()
calls H5iospNew.reacquire()
H5iospNew.reacquire()
calls super.reacquire()
super
refers to AbstractIOServiceProvider
, I believe
AbstractIOServiceProvider.reacquire()
calls RandomAccessFile.reacquire()
directly without knowing what the original object that triggered the reacquire came from.
Without me fully understanding what the IOServiceProvider
is supposed to do, perhaps the NetcdfFile object's iosp
is where we could place some logic that recognizes which reacquire()
(and other related methods()) to call?
Oh, wait, slightly incorrect:
AbstractIOServiceProvider.reacquire()
calls a one-parameter version of reacquire(location)
, which isn't implemented in RandomAccessFile
. I can't find this anywhere...
Gosh, sorry, my eyes are getting crossed... AbstractIOServiceProvider.reacquire()
calls RandomAccessFile.acquire(location)
(acquire vs. reacquire).
Ok, a possible design limitation, maybe?
In RandomAccessFile.java
public static RandomAccessFile acquire(String location) throws IOException {
if (cache == null) {
log.debug("No cache, building RandomAccessFile from {}", location);
return new RandomAccessFile(location, "r");
} else {
log.debug("There is a cache. Acquiring using factory {} for location {} from {}", factory, location, cache);
return (RandomAccessFile) cache.acquire(factory, DatasetUrl.create(null, location));
}
}
(logging is mine)
This is a static method utilizing a static factory
member. This FileFactory.open()
is set to always return a RandomAccessFile
. Meanwhile, RemoteRandomAccessFile
doesn't override this factory
instance, nor does it override acquire()
. So even if we were to be able to get RemoteRandomAccessFile.acquire()
to be called, it'll still be working with a regular RandomAccessFile
instance.
I think I'm at my limit of understanding how the caching mechanism is supposed to work, so I'm going to back off now. Let me know if there is anything you'd like me to try out or suggestions for additional logging, or anything else that would help you in figuring out a fix for this.
One thing that is a nagging question in my mind is why the cache is even doing acquisitions in the first place. The NetcdfFile instance is stored and retrieved from the Cache, so I don't understand what more is needed to be done.
Looking at this again with fresh eyes, there might be some sort of semi-recursion or something going on. If I am reading this right. The following is for a NetcdfDatasets.acquireFile()
call after the initial one done for inventorying the coordinates. So, there is a cache hit at first.
FileCache.acquireCacheOnly()
calls NetcdfFile.reacquire()
-- cache hit!NetcdfFile.reacquire()
calls H5iospNew.reacquire()
H5iospNew.reacquire()
calls super.reacquire()
super
refers to AbstractIOServiceProvider
, I believeAbstractIOServiceProvider.reacquire()
calls RandomAccessFile.acquire(location)
RandomAccessFile.acquire(location)
calls FileCache.acquire(FileFactory(RandomAccessFile), location)
FileCache.acquire(FileFactory(RandomAccessFile), location)
calls the many arg form ofFileCache.acquire(...)
FileCache.acquire(...)
calls FileCache.acquireCacheOnly()
and we are right back at the top again.The only thing different the second time through (actually, third time, but who's counting?) is that for some reason, it is a cache miss.
After the cache miss, it then goes and does a straight-up open, but it uses the RandomAccessFile
as the factory class instead of NetcdfDatasets.StandardFileFactory
.
So, it seems like there are two problems... maybe even three:
FileCache.acquireCacheOnly()
end up calling itself in a round-about way? Could this get short-circuited?acquireCacheOnly()
is recursing?acquire()
after a hit fails, it can properly rebuild the object using the factory?To answer my question earlier about why one would need to do a re-acquire in the first place even though we have the NetcdfFile object, it looks like it is necessary to re-establish the file object that backs the NetcdfFile object. If I comment out the acquire()
that is called for "rehydrating" the netcdffile object, I'll eventually get a NullPointerException:
ava.lang.NullPointerException: null
at ucar.nc2.iosp.IospHelper.readData(IospHelper.java:116) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.iosp.IospHelper.readDataFill(IospHelper.java:58) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.internal.iosp.hdf5.H5iospNew.readDataPrimitive(H5iospNew.java:576) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:390) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:264) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.internal.iosp.hdf5.H5iospNew.readData(H5iospNew.java:200) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.NetcdfFile.readData(NetcdfFile.java:2124) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.Variable.reallyRead(Variable.java:797) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.Variable._read(Variable.java:736) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.Variable.read(Variable.java:614) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.internal.ncml.AggProxyReader.reallyRead(AggProxyReader.java:37) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.dataset.VariableDS._read(VariableDS.java:432) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.Variable.read(Variable.java:614) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.dataset.CoordinateAxis1D.readValues(CoordinateAxis1D.java:860) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.dataset.CoordinateAxis1D.doRead(CoordinateAxis1D.java:757) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.dataset.CoordinateAxis1D.getCoordValue(CoordinateAxis1D.java:249) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.ft2.coverage.adapter.DtCoverageAdapter.makeCoordAxis(DtCoverageAdapter.java:218) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.ft2.coverage.adapter.DtCoverageAdapter.factory(DtCoverageAdapter.java:64) ~[cdm-core-5.5.3.jar:5.5.3]
at ucar.nc2.ft2.coverage.CoverageDatasetFactory.openNcmlString(CoverageDatasetFactory.java:142) ~[cdm-core-5.5.3.jar:5.5.3]
at thredds.core.DatasetManager.openCoverageDataset(DatasetManager.java:429) ~[classes/:5.4]
The raf
variable was null
.
This makes me wonder how would caching work for an S3-backed NetcdfFile. If the code is aggressively closing the NetcdfFile object, thereby dropping their (what I assume is a) S3RandomAccessFile
instance, then fixing this reacquire()
problem would only get us partway there because the data itself still isn't cached and would have to be re-fetched from S3 (assuming I am understanding everything correctly, which is a huge assumption). How does this work for a local file? Is reacquire() re-establishing the file object and re-reading data from the disk?
Could the solution to this problem be to have an "cdms3://" version of NetcdfFileProvider
, much like the CdmRemoteNetcdfFileProvider
class and supporting classes? I'm not fully understanding what exactly it is used for, but if one existed and was registered for cdms3://
addresses, then it would intercede in the file acquisition process before FileCache.acquireCacheOnly()
. Maybe be able to provide a proper spiObject
instead of null
?
Actually, that idea of using NetcdfFileProvider
seems to be an abuse of the system. I'm now looking back to what you originally suggested, using NetcdfFiles::getRaf()
. I think I figured out the right location for it, in AbstractIOServiceProvider::reacquire()
.
Ugh, there's a snag. ncfile
is of type NetcdfFile
, which has a private getRaf()
method, which we can't call from AbstractIOServiceProvider::reacquire()
. NetcdfFiles
(plural) has a public one, but I don't think it would get us the proper RAF anyway. In fact, I'm trying to figure out how the proper RAF gets determined in the first place.
Oh, wait! NetcdfFiles::getRaf()
is a static public method! I could use that regardless of what class ncfile
is!
That worked! The cache is a bit more better behaved now. The memory seems to grow a bit more slowly, and we are getting far fewer cache misses. It is still re-fetching the S3 data at about the same rate, as I suspected a few posts ago, but I consider this another step forward. I'll see if I can design a unit test and submit a PR.
With https://github.com/Unidata/netcdf-java/pull/1111 merged in, we should start seeing better cache utilization in the WMS (and other) components. My next target for better cache utilization is to see if we can get WMS to use the enhanced dataset object that netcdf-java already put together rather than doing it again for its version of the gridded dataset class. This would cut down significantly in extraneous reads, especially when the dataset requires reading the coordinate data.
Another possible area of improvement is to see if WMS really needs to keep opening and closing the files while building the dataset. It will open a file (requires a HEAD operation for S3 and a fetch of the first block every time), read a single coordinate variable (another fetch, usually), then close the file. Then open the file again (new HEAD and fetch), and read the next coordinate variable (usually in the same block as the other, but that's gone since closing the file drops the remote read cache). Another way to approach this problem is to allow the remote read cache to persist between reacquires.
I believe we fixed several performance issues and the issue related to the cache not being used for S3, so I am going to close this. It sounds like you still have other ideas for improvements-- feel free to put them in a new issue.
Using v5.4 (with the sts patch).
Been hunting down performance issues serving WMS layers for certain datasets served from S3 for the past couple of days (both memory and time are much too much). I added some logging configs for awssdk and discovered that to load a single moment in time the same file on S3 was being fetched 360 times for one of my datasets, and 11696 for another higher resolution dataset!