digitalearthpacific / dep-tools

Processing tools for Digital Earth Pacific
MIT License
1 stars 0 forks source link

Figure out multiband writes #44

Open jessjaco opened 8 months ago

jessjaco commented 8 months ago

Since the popular approach is to only write singleband tiffs, datasets with the same source data end up running the dask graph for each band. Threaded writes don't resolve this. Loading into memory before calling the write function works but is wasteful (since writing the corresponding multiband tiff to memory compresses the data). One solution might be to write to memory, then parse the bands. Or, look in the stac guidance for a possible way to write multibands

jessjaco commented 8 months ago

See https://github.com/gjoseph92/stackstac/issues/62

jessjaco commented 8 months ago

From that link, 1) You can write multiband stac items but 2) Neither stackstac nor odc.stac.load (I tested it) can read them

One workaround is to create a vrt for each band, but there is a very good point that GeoTiff's are pixel interleaved by default, though they can be interleaved by band. However, while the gdal GeoTiff driver supports band interleaving, the gdal COG driver doesn't. This is even more confusing as the COG standard supports BSQ writing. So ultimately, while the VRT approach may work, there are performance considerations on read (as an aside, this may be why some operations on the multiband tide data are so memory intensive). (Though also consider writing a COG using the GeoTiff driver as we used to do).

jessjaco commented 8 months ago

I think the simplest (probably) workable solution is to offer the option to load a dataset right before write - when the values have been (in most cases) scaled to their minimal representation. Not dissimilar to what Alex was doing in the PR I refused last week. These shouldn't be that large for the grid size we're dealing with. The only frustration is then we will have two versions in memory at one (one uncompressed as an xarray, and one compressed as a blob).

This precludes us from the ultimate goal of never having a whole dataset in memory at once, but that hasn't yet been possible (unless we use the dask writer to s3 from odc, which we haven't)

alexgleith commented 7 months ago

unless we use the dask writer to s3 from odc

I have some hesitations about that writer. It's not using GDAL at all, and I worry about the maintenance of it. I also had an issue when trying to use it, but that might have just been my environment.

the simplest (probably) workable solution is to offer the option to load a dataset right before write

I've been having errors when not loading data into memory before writing, possibly only with big dask graphs. Doing the load before writing has proven reliable.

jessjaco commented 7 months ago

unless we use the dask writer to s3 from odc

I have some hesitations about that writer. It's not using GDAL at all, and I worry about the maintenance of it. I also had an issue when trying to use it, but that might have just been my environment.

the simplest (probably) workable solution is to offer the option to load a dataset right before write

I've been having errors when not loading data into memory before writing, possibly only with big dask graphs. Doing the load before writing has proven reliable.

I haven't had errors, but if the bands are written to separate files, it will load common source bands (like qa_pixel) multiple times. My guess is this is part of the issue you were experiencing that made you implement multithreaded writes