ajnisbet / opentopodata

Open alternative to the Google Elevation API!
https://www.opentopodata.org
MIT License
312 stars 69 forks source link

Use mmap to cache file for subsequent reads. #74

Closed meierbenjamin closed 1 year ago

meierbenjamin commented 1 year ago

By using mmap we can store the elevation data files on a slow network storage and still get good performance for multiple hits on the same file.

ajnisbet commented 1 year ago

Thanks for this PR! Pointing opentopodata at a network filesystem is pretty common, and I'd love to speed that up!

I'm not super familiar with how mmap works, would you mind helping me out with some of the details? Depending on that, it might make sense to put this functionality behind a config flag.

My (apparently incorrect!) understanding of mmap was that it lets you lazily open a file, then only have to read from disk the parts you're actually reading. But rasterio only reads the minimal geotiff blocks that are needed.

meierbenjamin commented 1 year ago

We are using srtm32m which has only small files. The files are geotiffs. In our application we read the elevation profile along a line in 100 points. This internally will read from the same file multiple times. For most cases it is actually 100 reads from the same file.

I didn't dig in too deep but saw that rasterio reads this quite slowly from our CIFS attached network storage. Approximately 12ms per read. For 100 points the API response time was usually about 1.5s. With the mmap in between it went down to 400ms.

I also tried with a cache layer which would just keep the rasterio handle for multiple requests, with no improvement. Then I tried reading the whole file (would obviously not work for lager files) into RAM using io.BytesIO. That worked quite well but the initial read was slow (I think 150ms) while rasterio actually didn't need the whole file.

mmap seems to be limited not by physical memory (RAM size) but rather virtual memory, mainly depending on the OS. For a 64bit OS this should be not an issue. However I do not have large files to test this out.

Anyhow, a configuration flag and/or fallback mechanism would be nice, I agree.

The data is not stored between requests. The mmap should be released after the with block. I think this could further improve subsequent reads, but probably would require a more clever cache to handle multiple (say 10 or so) open files at once.

But I think we are getting in the range where one should probably first consider switching to FastAPI which speeds up every single request significantly. But probably would run into issues with uwsgi.

ajnisbet commented 1 year ago

That makes sense: there must be something rasterio isn't caching that causes a repeated disk read on every f.read call.

I've merged the PR as is, and will move the functionality behind a config flag before merging into master.

Thanks for the PR!

ajnisbet commented 1 year ago

Unfortunately, I've had to remove mmap from the dev branch.

Testing it more thouroughly there are a number of issues:

There are too many issues to make use_mmap a "user-beware" config var, or apply only to geotiffs. It's important to me that opentopodata works with real-life datasets, which are often messy.

I'm disappointed about removing this cause lots of people keep their rasters on networked filesystems, and I have 1st had experience with how painful that can be!

(I also wonder if this is a raster or gdal bug/enhancement: gdal already has an in-memory cache for actual raster block data, it's odd there isn't a cache for geotiff metadata).

ajnisbet commented 1 year ago

This was the closest I got trying to get it to work: f5f9cb8cfff92524eefc08a68739a70b4aa5e392

I would of course accept a PR with all tests passing on both the standard and M1 docker images, plus new tests for each of the virtual file systems linked above! But I suspect it's not worth the effort.