Open cwardgar opened 7 years ago
Our point stack exhibits a structure-oriented data access pattern, meaning that one value from each data variable is needed for each observation that it returns. The problem is that we're making an HTTP request for each value because the data is not truly laid out as a collection of structures (i.e. records). Instead, we're cheating by organizing the variables into a pseudo-structure: a collection of variables which all have the same outer dimension (it's not a real Structure because the values are not stored contiguously). All CF DSG layouts require this pseudo-structure interpretation, which means they'll all have the same poor read performance by our point stack.
To improve performance, we'd like to make fewer requests by grabbing more than one datum each time and caching the unneeded data that we get back for subsequent calls. But how? We can't just naively cache the entire variable. What if it's huge? Incidentally, we already do this for 1D coordinate axes, regardless of their size. I'm surprised we haven't been bitten by that yet.
Perhaps we can cache only some of the data? For example, if the user requests temperature[0:1:0]
, we actually grab temperature[0:1:100]
and cache the excess, hoping that the user will read the other values within that range next. Obviously, effectiveness depends on access pattern, but this might be a reasonable strategy, especially for point data.
@JohnLCaron Any thoughts on this?
Im > The problem is that we're making an HTTP request for each value
Through opendap? Ive given up on opendap for this reason, and cdmremote was the attempt to solve it.
To improve performance, we'd like to make fewer requests by grabbing more than one datum each time
The whole point of DSG is to use iterators instead of direct access, allowing us to efficiently cache.
On Tue, Nov 1, 2016 at 2:37 AM, Christian W notifications@github.com wrote:
Our point stack exhibits a structure-oriented data access pattern, meaning that one value from each data variable is needed for each observation that it returns. The problem is that we're making an HTTP request for each value because the data is not truly laid out as a collection of structures (i.e. records). Instead, we're cheating by organizing the variables into a pseudo-structure: a collection of variables which all have the same outer dimension (it's not a real Structure because the variables are not stored contiguously). All CF DSG layouts require this pseudo-structure interpretation, which means they'll all have the same poor read performance by our point stack.
To improve performance, we'd like to make fewer requests by grabbing more than one datum each time and caching the unneeded data that we get back for subsequent calls. But how? We can't just naively cache the entire variable. What if it's huge? Incidentally, we already do this for 1D coordinate axes, regardless of their size. I'm surprised we haven't been bitten by that yet.
Perhaps we can cache only some of the data? For example, if the user requests temperature[0:1:0], we actually grab temperature[0:1:100] and cache the excess, hoping that the user will read the other values within that range next. Obviously, effectiveness depends on access pattern, but this might be a reasonable strategy, especially for point data.
@JohnLCaron https://github.com/JohnLCaron Any thoughts on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/thredds/issues/671#issuecomment-257514217, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlcwKMt-EChuhi1XM73QppBXEADKkwVks5q5vo0gaJpZM4Kl2bS .
So to fill in the blanks, by having the iterator model for access, we can prefetch however we want when making the request, without the user ever knowing about it.
Side note: it's interesting that this problem seems to correspond exactly to the way overallocation of vectors/list allow one to avoid O(N^2) behavior when appending. (I realize we're missing the copy that causes the N^2, but still interesting.)
The problem is that we're making an HTTP request for each value
Through opendap? Ive given up on opendap for this reason, and cdmremote was the attempt to solve it.
CDM Remote exhibits the same behavior:
CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header
CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header took 50 msecs
Obs 1:
CdmRemote data request for variable: 'temperature' section=(0:0)
CdmRemote data request for variable: 'pressure' section=(0:0)
CdmRemote data request for variable: 'svel' section=(0:0)
CdmRemote data request for variable: 'z' section=(0:11364)
Obs 2:
CdmRemote data request for variable: 'temperature' section=(1:1)
CdmRemote data request for variable: 'pressure' section=(1:1)
CdmRemote data request for variable: 'svel' section=(1:1)
Obs 3:
CdmRemote data request for variable: 'temperature' section=(2:2)
CdmRemote data request for variable: 'pressure' section=(2:2)
CdmRemote data request for variable: 'svel' section=(2:2)
Obs 4:
CdmRemote data request for variable: 'temperature' section=(3:3)
CdmRemote data request for variable: 'pressure' section=(3:3)
CdmRemote data request for variable: 'svel' section=(3:3)
Obs 5:
CdmRemote data request for variable: 'temperature' section=(4:4)
CdmRemote data request for variable: 'pressure' section=(4:4)
CdmRemote data request for variable: 'svel' section=(4:4)
I've narrowed the performance problem to StructureDataIteratorLinked.next(). Here we read a single StructureData
(record) at a time, which causes the HTTP requests for single values. Instead, we could be prefetching with a call like:
ArrayStructure records = s.readStructure(currRecno, count);
So yeah, I definitely like the idea of prefetching in the iterators rather than prefetching in the Variables, since the iterators have a known access pattern. I'll change the title of this issue.
hmmm, i have forgotten the details, so i will have to review them at some point.
perhaps its the "cdm feature" API that is supposed to solve the problem by caching in the iterators.
On Tue, Nov 1, 2016 at 2:35 PM, Christian W notifications@github.com wrote:
The problem is that we're making an HTTP request for each value
Through opendap? Ive given up on opendap for this reason, and cdmremote was the attempt to solve it.
CDM Remote exhibits the same behavior:
CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header CdmRemote request http://localhost:8080/thredds/cdmremote/local-tds-data/support/2OQ0t7/NERSC_ARC_PHYS_OBS_XBT_2012_v1.nc?req=header took 50 msecs Obs 1: CdmRemote data request for variable: 'temperature' section=(0:0) CdmRemote data request for variable: 'pressure' section=(0:0) CdmRemote data request for variable: 'svel' section=(0:0) CdmRemote data request for variable: 'z' section=(0:11364) Obs 2: CdmRemote data request for variable: 'temperature' section=(1:1) CdmRemote data request for variable: 'pressure' section=(1:1) CdmRemote data request for variable: 'svel' section=(1:1) Obs 3: CdmRemote data request for variable: 'temperature' section=(2:2) CdmRemote data request for variable: 'pressure' section=(2:2) CdmRemote data request for variable: 'svel' section=(2:2) Obs 4: CdmRemote data request for variable: 'temperature' section=(3:3) CdmRemote data request for variable: 'pressure' section=(3:3) CdmRemote data request for variable: 'svel' section=(3:3) Obs 5: CdmRemote data request for variable: 'temperature' section=(4:4) CdmRemote data request for variable: 'pressure' section=(4:4) CdmRemote data request for variable: 'svel' section=(4:4)
I've narrowed the performance problem to StructureDataIteratorLinked. next() https://github.com/Unidata/thredds/blob/5.0.0/cdm/src/main/java/ucar/nc2/ft/point/StructureDataIteratorLinked.java#L70. Here we read a single StructureData (record) at a time, which causes the HTTP requests for single values. Instead, we could be prefetching with a call like:
ArrayStructure records = s.readStructure(currRecno, count);
So yeah, I definitely like the idea of prefetching in the iterators rather than prefetching in the Variables, since the iterators have a known access pattern. I'll change the title of this issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/thredds/issues/671#issuecomment-257687279, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlcwP5KhOBc23oNNSETtAyv08-J1Qbhks5q56KEgaJpZM4Kl2bS .
It turns out that we already have this prefetch capability: StructureDataIterator.setBufferSize(int bytes)
. However, the only subclass that actually implements it is Structure.IteratorRank1
. A better solution would be to move that functionality to a wrapper, and decorate other StructureDataIterator
s with it.
Something else I noticed: 3fec809ad6c638fcfb8c13889534e1ee4f8ec53c removes bufferSize
from PointFeatureIterator
s (and many of the underlying StructureDataIterator
s). Was that a mistake if we want to enable prefetching? Should I revert it?
Also, for many PointFeatureCollection
s, there wasn't an easy way to change the bufferSize
of the underlying PointFeatureIterator
, even before the above commit. That needs to change.
Man, at this point, if the (Jenkins) tests pass on 5.0
with a revert and the additions to the StructureDataIterators
s , I'd say go for it and we'll deal with the repercussions, if any 👍
Is this needed for 4.6.x
?
4.6 could certainly use it; my comments today were the result of an issue I'm working on with Yuan in the IDV relating to slow reading of PointFeatures. However, the point stack has changed so drastically in 5.0 that I'd have to implement 2 completely different fixes. I'm not gonna do that, so this'll be 5.0-only.
TLDR: Reading remote CF DSG datasets via OPeNDAP and CDM Remote (and probably DAP4) is terribly slow.
This issue was originally raised in a netcdf-java mailing list message. Example dataset is a contiguous ragged array representation of profiles:
I popped it on a local THREDDS server and read the first 5 observations using both OPeNDAP and CDM Remote. I used the code:
Results for OPeNDAP, with
ucar.nc2.dods.DODSNetcdfFile.debugServerCall = true
Results for CDM Remote, with
ucar.nc2.stream.CdmRemote.showRequest = true
, are similar.So, to read the entire example file as a
FeatureDatasetPoint
, we'd need to make roughly11365 * 3 = 34095
HTTP requests! And the file is only 182 KB! As you can imagine, that's very slow. It took about ~25 seconds when reading from my local server, but upwards of an hour when reading from an external server.