OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.9k stars 2.55k forks source link

ENVI sparse datasets #8895

Open chacha21 opened 11 months ago

chacha21 commented 11 months ago

Currently, GDAL supports a "sparse" feature for GTiff (see here) I am interested in seeing such a feature for ENVI datasets. I understand that it is something that might not fit at all in the code base.

However, I was looking at the doc of FSCTL_SET_ZERO_DATA for NTFS filesystems. Actually, it seems reasonable to manually call FSCTL_SET_ZERO_DATA for parts of the dataset that I know being empty, without expecting GDAL to detect and do that automatically.

There could be different strategies :

Is this realistic as a feature request ?

rouault commented 11 months ago

have you looked at https://gdal.org/development/rfc/rfc63_sparse_datasets_improvements.html ? and the Truncate() and GetRangeStatus() in VSIVirtualFileHandle ? They are implemented by the Win32 API. You might just have to implement IGetDataCoverageStatus() in RAWRasterBand

chacha21 commented 11 months ago

have you looked at https://gdal.org/development/rfc/rfc63_sparse_datasets_improvements.html ?

This is the link I quoted, but it does not mention Truncate(), I thought it was more related to Getting information than actually "sparsing" data.

rouault commented 11 months ago

Probably that ENVIDataset::Close() in the bFillFile case should use Truncate() (which can actually enlarge file) rather than Seek(file_size) + Write(single zero). I don't remember the details if Seek()+Write() is considered as sparse file. Probably not

chacha21 commented 11 months ago

I think we are not talking of the same use case. Let me describe in a few words :

  1. I have an existing dataset backed by an ENVI file
  2. I update it after some process. I know that some part P will be entirely zeroed
  3. If I can compute the mapping file range of P, I can invalidate the cache and call FSCTL_SET_ZERO_DATA on that range
  4. the filesystem will entirely handle the "sparsing" action, and GDAL does not even have to know. If it tries to read from that part of the file after that, it will get 0s automatically. No need of IGetDataCoverageStatus()
  5. the storage size is now optimized, saving resources
  6. go back to step 2 : I don't want to close the datacube yet !
  7. it does not prevent to write new data inside P. The file system will automatically "cancel" the sparse blocks. Once again GDAL does not even have to care.

In this scenario, "sparsing" is a manual action, I don't expect GDAL to call it automatically. I only need a way to get the file range.

[edit] I understand that there is no guarantee that the file will remain sparse. If GDAL attempts to write 0s in the file, it will "unsparse". However, I don't want GDAL to be clever here. It delegates to me the responsibility of not updating the datacube (even with 0) where I know it is "empty". The risk is that some cache blocks overlaps "empty" parts of the file and does not allow an optimal sparsing. But for huge datacubes, the gain would be already substantial !

rouault commented 11 months ago

What about having a special behaviour in IRasterIO(GF_Write, ....) that would detect that the provided buffer is fully zero/nodata value and call FSCTL_SET_ZERO_DATA. Perhaps controlled by an open option to allow that ?

chacha21 commented 11 months ago

What about having a special behaviour in IRasterIO(GF_Write, ....) that would detect that the provided buffer is fully zero/nodata value and call FSCTL_SET_ZERO_DATA. Perhaps controlled by an open option to allow that ?

Using options could be an idea :

[edit]And there would be no obligation for GDAL to honour the flag. If for some reason sparsing is not callable, 0s would be actually written.

rouault commented 11 months ago
  • or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant

I don't think we need a special flag. A simple heuristics is to look for example at the 4 corners of the buffer + center pixel. If they are both zero, then the likelihood that the buffer is full zero becomes high and you can do the full check that it is actually only zero. That's actually what the GDALBufferHasOnlyNoData() function used by GTiff driver does to detect if a tile can be omitted.

chacha21 commented 11 months ago
  • or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant

I don't think we need a special flag. A simple heuristics is to look for example at the 4 corners of the buffer + center pixel. If they are both zero, then the likelihood that the buffer is full zero becomes high and you can do the full check that it is actually only zero. That's actually what the GDALBufferHasOnlyNoData() function used by GTiff driver does to detect if a tile can be omitted.

I might have a strong argument against the overhead of auto-detecting zeros, but I need to check first some behaviour of FSCTL_SET_ZERO_DATA

rouault commented 11 months ago

I might have a strong argument against the overhead of auto-detecting zeros

It would be worth benchmarking the actual cost of GDALBufferHasOnlyNoData()

chacha21 commented 11 months ago

I made some tests with FSCTL_SET_ZERO_DATA, and I want to highlight some behaviour details that I think are arguments against the overhead of an auto check for 0 data in GDAL.

Let's say that "Z" means "using FSCTL_SET_ZERO_DATA". Z at the end of a file is special, let's talk about Z in the middle of a file to "punch a hole"

The main thing is that FSCTL_SET_ZERO_DATA needs a minimal atomic size to be efficient. What I mean is :

As a consequence, it means that sparsing efficiency through GDAL would be relevant only for "big contiguous writes" in RasterIO().

However, in the context of ENVI file and BIL/BIP/BSQ layouts, I feel that it will be a problem.

For sparsing to be efficient, we must be sure that :

I am not sure that delegating this analysis to GDAL would relevant. Sometimes sparsing would occur, sometimes not. Forcing the user to fully endorse the efficiency of a Z could be less misleading. Thus, the user will be responsible to tell if the data is zeros, so no need to auto-check. And the user will know better anyway, it might be redundant to let GDAL check once more.

[edit] Another info related to GDAL's NO_DATA Microsoft claims that "The default data value of a sparse file is zero; however, it can be set to other values". https://learn.microsoft.com/en-us/windows/win32/fileio/sparse-files I did not find how to do that, but it might be relevant to know.

rouault commented 11 months ago

I would be rather opposed to having a new method that would take explicitly file offsets. That would feel really awkward in the GDAL Dataset/RasterBand API which is totally agnostic of such low level details. Better let the user do that outside of GDAL then

If that would be done by GDAL, perhaps a CPLErr GDALDataset::Fill(int nXOff, int nYOff, int nXSize, int nYSize, int nBands, const int* panBands, double dfFillReal, double dfFillImag) virtual method whose default implementation would call IRasterIO() ? RawDataset could implement that, compute the file offsets and decide if it must FSCTL_SET_ZERO_DATA (a corresponding method SetZero() would have to be added to VSIVirtualHandle).

chacha21 commented 11 months ago

The two-level solution GDALDataset::Fill() as RasterIO() by default and SetZero() if possible sounds good because:

However once again I feel that care must be taken.

Indeed, using Fill() with zeros to sparse the file requires either to also fill the cache with zeros, or drop the cache (I think that the filesystem will be pretty fast at returning dummy 0s for those sparsed regions, perhaps as fast as a memset())

chacha21 commented 10 months ago

I drop a few more details about the sparse feature of NTFS :

https://learn.microsoft.com/en-us/answers/questions/1459700/ntfs-sparse-files-range-boundaries

There is indeed a magical un-documented 64KB boundary for minimal regions to be effectively sparsed The "non zero sparsed data" is a documentation error and is not supported