GeoTIFF / geoblaze

Blazing Fast JavaScript Raster Processing Engine
http://geoblaze.io
MIT License
181 stars 28 forks source link

investigate multi-band performance #226

Open DanielJDufour opened 11 months ago

DanielJDufour commented 11 months ago

Describe the bug Performance seems to be better when you split each band into a separate file; rather than combing them all into one file. This is counter-intuitive, so will need investigation

To Reproduce Steps to reproduce the behavior: I've written a test, which I'll share

Expected behavior Multiband case is faster than separate band case.

Screenshots none

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

rought benchmarks combined multiband with geoblaze.load: 5,885ms combined multiband with geoblaze.parse: 16,683ms

separated bands with geoblaze.load: 3,138ms (1250 + 1121 + 1064) separated bands with geoblaze.parse: 6,921ms (2364 + 2230 + 2245)

DanielJDufour commented 10 months ago

Welp, we can rule out georaster initliazation as the issue. I ran the same performance tests with just the initial parseGeoraster step (excluding the statistical calculations) and the results were as expected.

Singleband raster analysis took 287 ms Multiband raster analysis took 21 ms

Parsing all the single band rasters will naturally take longer because we'll have to issue a GET request for each of the 20 or so files' metadata whereas the multiband scenario will require just one fetch of the metadata (usually first 1024 bytes).

However, and most importantly, this initial metadata parsing contributes very little to the overall length of time it takes to run stats on the multiband raster ( 0.021 seconds vs. 23 seconds).

That being said, the initial metadata parsing does take up more than a full half of the time for the single band raster analysis (~287ms out of ~506ms). Meaning that, in theory, running stats on a multiband raster, could be faster someday once the main issue is fixed.

DanielJDufour commented 10 months ago

Further investigation reveals the bulk of the extra time is spent calling georaster.getValues. It takes on average 4ms for the single-band rasters, but takes 24,143 ms for the multiband raster. In other words, 99.9% of the time (24143ms of the total 24165ms) is spent fetching the data with getValues.

DanielJDufour commented 10 months ago

The plot thickens. I ran a test on geotiff.js readRasters method, which is what georaster uses and it doesn't seem like there's any critical issue with geotiff.js.

Singleband raster analysis took 186 ms Multiband raster analysis took 928 ms

I'm still curious why it takes more times to read a multiband raster than all rasters individually. In theory, reading the multiband should be faster for reasons expressed above. But it's a very small factor in the abnormally large geoblaze time.

DanielJDufour commented 10 months ago

oof, so when I copy the getValues function directly from the georaster source code, it runs very fast. HOWEVER, when I run georaster.getValues from the compiled georaster code it's SUPER slow. My current hypothesis is that the webpack building process is adding in a backwards compatability polyfill that is massively slowing things down. I'll keep digging!

DanielJDufour commented 10 months ago

well I ran the compiled version of the georaster code and it's fast, so I'm at a bit of a loss. It might be an issue with the way the georaster build compiled geotiff.js or another dependency. I'll look into it more.

DanielJDufour commented 10 months ago

Okay. I can confirm that the issue is with the way GeoRaster compiled geotiff.js. GeoRaster has a private property called _geotiff that gives you access to a geotiff.js instance of your raster. It's technically not the same geotiff.js instance as the one initially created to read the georaster-relevant metadata bc that's off the main thread.

    const georaster = await parseGeoRaster(url);
    const tiff = georaster._geotiff;
    await tiff.readRasters({...})
DanielJDufour commented 10 months ago

Fixing the georaster build process or getting the next major release out the door is going to be a large lift. However, here's the quick hacky fix for georasters that are loaded via url:

import { fromUrl } from "geotiff";

async function fix(georaster) {
  if (typeof georaster._data === "string" && georaster._geotiff) {
    georaster._geotiff = await fromUrl(georaster._data);
  }
}

const georaster = await parseGeoRaster(multiUrl);
await fix(georaster);
await geoblaze.sum(georaster, geometry);
avmey commented 7 months ago

I'm having trouble with this workaround, particularly using geotiff and ESM.

Error [ERR_REQUIRE_ESM]: require() of ES Module /workspaces/test-proj/node_modules/quick-lru/index.js from /workspaces/test-proj/node_modules/geotiff/dist-node/source/blockedsource.js not supported.

I found this thread you and Tim posted in, but was curious if these are still the most up to date solutions.

aLemonFox commented 7 months ago

Hi, is there any news on this? I have a cog of around ~8MB and it easily takes 30+ seconds to identify a single point from multiple bands (including fetching).