Use mmap to cache file for subsequent reads.

meierbenjamin commented 1 year ago

By using mmap we can store the elevation data files on a slow network storage and still get good performance for multiple hits on the same file.

ajnisbet commented 1 year ago

Thanks for this PR! Pointing opentopodata at a network filesystem is pretty common, and I'd love to speed that up!

I'm not super familiar with how mmap works, would you mind helping me out with some of the details? Depending on that, it might make sense to put this functionality behind a config flag.

My (apparently incorrect!) understanding of mmap was that it lets you lazily open a file, then only have to read from disk the parts you're actually reading. But rasterio only reads the minimal geotiff blocks that are needed.

What sort of speedup did you see on nfs? Are you using geotiffs?
Does mmap load/store the whole file in memory? It's important for opentopodata to be able to read larger-than-memory files.
How is data stored between requests? uwsgi doesn't actively clear memory after a request, but the code doesn't appear to explicity save to global state. Is it stored at the OS level, like the page cache?
uwsgi runs many processes: if subsequent requests for the same file are allocated to different worker processes, will the data still be cached?

meierbenjamin commented 1 year ago

We are using srtm32m which has only small files. The files are geotiffs. In our application we read the elevation profile along a line in 100 points. This internally will read from the same file multiple times. For most cases it is actually 100 reads from the same file.

I didn't dig in too deep but saw that rasterio reads this quite slowly from our CIFS attached network storage. Approximately 12ms per read. For 100 points the API response time was usually about 1.5s. With the mmap in between it went down to 400ms.

I also tried with a cache layer which would just keep the rasterio handle for multiple requests, with no improvement. Then I tried reading the whole file (would obviously not work for lager files) into RAM using io.BytesIO. That worked quite well but the initial read was slow (I think 150ms) while rasterio actually didn't need the whole file.

mmap seems to be limited not by physical memory (RAM size) but rather virtual memory, mainly depending on the OS. For a 64bit OS this should be not an issue. However I do not have large files to test this out.

Anyhow, a configuration flag and/or fallback mechanism would be nice, I agree.

The data is not stored between requests. The mmap should be released after the with block. I think this could further improve subsequent reads, but probably would require a more clever cache to handle multiple (say 10 or so) open files at once.

But I think we are getting in the range where one should probably first consider switching to FastAPI which speeds up every single request significantly. But probably would run into issues with uwsgi.

ajnisbet commented 1 year ago

That makes sense: there must be something rasterio isn't caching that causes a repeated disk read on every f.read call.

I've merged the PR as is, and will move the functionality behind a config flag before merging into master.

Thanks for the PR!

ajnisbet commented 1 year ago

Unfortunately, I've had to remove mmap from the dev branch.

Testing it more thouroughly there are a number of issues:

Rasterio sometimes needs the filename to determine which driver to use to open the file (some geospatial file formats don't include headers / magic numbers).
GDAL sometimes needs the filename to determine the spatial extent of the raster (not all geospatial file formats store this information in the file)
GDAL supports a number of "magic" filename prefixes which instruct GDAL to do things like look inside zip files and request urls (https://gdal.org/user/virtual_file_systems.html)
Random intermittent SEGFAULTs.

There are too many issues to make use_mmap a "user-beware" config var, or apply only to geotiffs. It's important to me that opentopodata works with real-life datasets, which are often messy.

I'm disappointed about removing this cause lots of people keep their rasters on networked filesystems, and I have 1st had experience with how painful that can be!

(I also wonder if this is a raster or gdal bug/enhancement: gdal already has an in-memory cache for actual raster block data, it's odd there isn't a cache for geotiff metadata).

ajnisbet commented 1 year ago

This was the closest I got trying to get it to work: f5f9cb8cfff92524eefc08a68739a70b4aa5e392

I would of course accept a PR with all tests passing on both the standard and M1 docker images, plus new tests for each of the virtual file systems linked above! But I suspect it's not worth the effort.

ajnisbet / opentopodata

Use mmap to cache file for subsequent reads. #74