Open zakjan opened 3 years ago
@zakjan Thanks for reporting this. I'm on it!
Thanks!
@zakjan
This should be fixed now (6b2683762a46d2f0850833feea4d825e77d9f4c8 and v1.0.4). Please verify.
No, I'm getting still the same error. It works for you on the dev page? Note that AggregateError: Request failed (caused by response undefined) is thrown first, if it means anything.
@constantinius I will try to get to writing this up more full on Monday, but I was working on the abort controller and noticed something similar. At least with that, I found the LRUCache
to be problematic (at least with that feature). I will try to write up something more formal soon.
@zakjan I think that the test page is broken (introduced by a failed merge by me 🙄 ), I will fix it. COG-Explorer now works again, so I'm quite certain.
@ilan-gold I see. Yes please, do so. This whole blocked source thingy is giving me headaches for too long now...
Okay, I fixed the test page. Its true, sometimes I still get the Block is not in the block requests
error, but to a far lesser extent than before. I think there is still a race condition somewhere.
@constantinius I was able to get the test page working except for 404's for some tiff files I was not able to generate due to some GDAL issue. If you could post some of the not-working TIFFs, that'd be great.
I believe that these lines: https://github.com/geotiffjs/geotiff.js/blob/a0153dc0d4956fca78e84bd9ce26d1995c89d32d/src/source/blockedsource.js#L104-L118 contain a race condition. The scenario goes:
fetch-1
is initiated and adds blockA
to this.blockIdsToFetch
and missingBlockIds
fetch-2
is initiated and only finds blockA
in this.blockIdsToFetch
but not this.blockRequests
fetch-2
needs that block for its own request (or fetches it anyway, as is now possible) calling this.blockIdsToFetch.clear()
and this.blockRequests.delete(blockId)
at the end of the requestfetch-1
comes looking for the block in either the this.blockCache
and this.blockRequests
but never finds it because fetch-2
deleted it after completing its request so quickly and the block was pushed out by othersI think this is very unlikely to happen but theoretically possible without the abort signal (and it becomes very clear if you start using abort controllers because fetch-2
can complete its request very quickly if it just aborts).
Neither fetch-1
nor fetch-2
in this scenario ever actually adds blockA
to blockRequests
so simply deleting the error at the end of the if...else if...else
in the above will not help since let results = await Promise.allSettled(blockRequests.values());
will not have that block. Thus its error cannot be checked by the subsequent code or refetched.
I started a new branch to resolve this using the mechanism in #182 and it seems to work well (likely because it looks very similar to the old code which seemed to work) except that the LRU cache doesn't work with it, which seems understandable since the cache is expected to be consistent for so long throughout the whole fetch
call. So setting that to have infinite size (or equivalently going back to a Map
) should work. I would be in favor of going back to a Map
because I think the LRUCache
is fairly prone to race conditions if you have requests coming in and out quick enough (for example calling has
doesn't guarantee that get
will actually return anything from what I understand). However, I don't know if people have complained about memory issues so I would understand keeping it.
Let me know if this makes sense or not. Would love to understand the use case more for changing the code so much and make progress on this. Thanks and sorry for the delay!
Hi @ilan-gold
I was able to get the test page working except for 404's for some tiff files I was not able to generate due to some GDAL issue.
If you are referring to LERC compressed files, this is caused by the recent merging of https://github.com/geotiffjs/geotiff.js/pull/206 which requires either GDAL built with internal libtiff (most of GDAL packages are built with system libtiff instead), or GDAL master (to be released as GDAL 3.3). You can safely comment these files out for now.
It did seem that those were the only ones with errors. Could you post a link (ideally) or point to one file for which the current implementation of geotiff.js runs into this error? It's possible there is none, I just couldn't fully follow the thread here.
@zakjan All of those files worked for me fine. I think the issue happens only with a high velocity of reads. The only file that I couldn't find/download/create was 5ae862e00b093000130affda.tif
.
@constantinius One option that just occurred to me would be to not use the cache at all when using the AbortSignal
. I am going to play around this, but 1) especially since it is a browser-only technology from what I understand 2) browsers have caches anyway and 3) this aborting mechanism acts as a good throttle against making too many requests for too much data, it might make sense to skip the cache if the signal
is passed in.
This would also allow you to keep the LRUCache
which I imaging helps for geospatial-oriented/node applications more than it does for these "high velocity" image pyramids.
This started happening after adding abortController signal to readRGB function then aborting it. Error does not propagate to the readRGB as my readRGB is in a try catch but AbortError is never thrown.
Also, even my try/catch 'catches the Block is not in the block requests' error, it is still thrown and logged on the console before it reaches my try/catch.
Has there been any progress on this? This is a pretty basic use case that seems to fail most of the time.
I've been looking at this a little recently as I was working on requesting tiles from a COG.
It seems like there is an issue with the blockCache. I think what's happening is that an id is present in the cache at the start of the fetch
function, but by the end when it's time to retrieve the block from the blockCache it's been pushed out of the cache by more recent arrivals, and so when it comes time to grab all the relevant blocks from the cache occasionally something is no longer available.
https://github.com/geotiffjs/geotiff.js/blob/4de2c3deb2e751809afada6546330462b9406ae6/src/source/blockedsource.js#L141.
As an easy fix I bumped up the cacheSize from 100 to 1000 and all my errors went away, however there is probably a smarter way to manage things rather than relying on a magic number.
@acthp Have you managed any progress on this?
I have code that has been reliable for us in beta. The abort code with reference counting works, and I can limit the cache size, which also works. There are some issues, as follows.
The original code had mechanisms for combining different requests. It would do a setTimeout of zero to collect requests before any are issued, and it would combine consecutive block ranges into larger range requests. I have some code to combine blocks, but I haven't committed it because it adds complexity, and I am unsure of the benefit of doing this. When combining the ranges, some of the requests end up being quite large, which seems to work at cross-purposes to using a block model for the file, e.g. everything stops while some 12M range query is satisfied, rather than using smaller ranges that will update the view more quickly. Also, HTTP connections in the browser are persistent, so I wouldn't expect a large performance difference between sending many small requests vs. a few large ones, so long as it doesn't run over the browser's outstanding-query limit.
Do you know if there is good performance data motivating the block-combining feature, and if it's better than just increasing the block size? Right now I'm using it in beta with a block size of 256k or 512k, with good results.
I haven't committed the unit tests yet, because the expected test results are different for the block-combining vs. non-block-combining versions.
Also regarding performance, the initial zooms/pans are sometimes slower than we like, and the abort support is not used as much as I would expect. Digging through the code, I suspect this is because it is fetching directory structures for different zoom levels, and those fetches are done without any abort support. Unfortunately, this means that the most critical data (the current view) is the last to load, after the intermediate zoom levels that are no longer in view. So, at the moment I'm reviewing this code to see if it would be easy to add abort support.
My current code is here: https://github.com/ucscXena/geotiff.js/blob/ucsc-beta/src/source/blockedsource.js
It's in a more functional style, because I find it hard to follow large, bespoke state machines. It reads more like what the algorithm does, "do A, do B, do C", rather than the desired behavior being an emergent property of the interaction of the different state variables in the object. It looks really different, though, so not sure it will fit well with the rest of the code. I wrote it with rx, but it might be possible to use a promise library for the high-level operators, instead.
Also note there are a couple off-by-one bugs in the http code:
https://github.com/ucscXena/geotiff.js/commit/715ef8a977a91a390471b3e581870c499a0ec274
Range headers are closed, not half-open.
My code will fail w/o this fix.
@acthp I am not familiar with the specifics why things were done a certain way in this repo, but I imagine that since our use-case is the most stressful, whatever you are seeing that works will probably the best thing to do. MY intuition about block size is similar to yours with the additional observation that if HTTP2 (and 3) are built for many small requests, so I would expect smaller requests to be better there. We use HTTP2 for our Avivator demos and found it to be qualitatively faster. I would be happy to review your PR if @constantinius can't since I have a decent sense of what is going on.
@constantinius Do you have any thoughts here?
I fix this problem with using GeoTIFFSourceOptions, like this:
const sourceOptions: GeoTIFFSourceOptions = {
allowFullFile: true,
}
const source = new GeoTIFF({
sources: [
{
url: url,
},
],
sourceOptions: sourceOptions
});
It's worked for me !
When running
npm run dev
with the latest master, uncommented all test files, and port 8090, some files don't render, and console contains lots of these errors: