ENVI sparse datasets - Githubissues

chacha21 commented 11 months ago

Currently, GDAL supports a "sparse" feature for GTiff (see here) I am interested in seeing such a feature for ENVI datasets. I understand that it is something that might not fit at all in the code base.

However, I was looking at the doc of FSCTL_SET_ZERO_DATA for NTFS filesystems. Actually, it seems reasonable to manually call FSCTL_SET_ZERO_DATA for parts of the dataset that I know being empty, without expecting GDAL to detect and do that automatically.

There could be different strategies :

a dataset might have some declareEmpty(data_range) API not especially related to the "NO_DATA" value. Internally, GDAL could call FSCTL_SET_ZERO_DATA if it can be mapped to a file range (depending of dataset and data layout). The cache would be automatically informed.
an ENVI dataset might have something similar to H5Dget_chunk_info to help the user compute the file part matching a dataset data range.

Is this realistic as a feature request ?

rouault commented 11 months ago

have you looked at https://gdal.org/development/rfc/rfc63_sparse_datasets_improvements.html ? and the Truncate() and GetRangeStatus() in VSIVirtualFileHandle ? They are implemented by the Win32 API. You might just have to implement IGetDataCoverageStatus() in RAWRasterBand

chacha21 commented 11 months ago

have you looked at https://gdal.org/development/rfc/rfc63_sparse_datasets_improvements.html ?

This is the link I quoted, but it does not mention Truncate(), I thought it was more related to Getting information than actually "sparsing" data.

rouault commented 11 months ago

Probably that ENVIDataset::Close() in the bFillFile case should use Truncate() (which can actually enlarge file) rather than Seek(file_size) + Write(single zero). I don't remember the details if Seek()+Write() is considered as sparse file. Probably not

chacha21 commented 11 months ago

I think we are not talking of the same use case. Let me describe in a few words :

I have an existing dataset backed by an ENVI file
I update it after some process. I know that some part P will be entirely zeroed
If I can compute the mapping file range of P, I can invalidate the cache and call FSCTL_SET_ZERO_DATA on that range
the filesystem will entirely handle the "sparsing" action, and GDAL does not even have to know. If it tries to read from that part of the file after that, it will get 0s automatically. No need of IGetDataCoverageStatus()
the storage size is now optimized, saving resources
go back to step 2 : I don't want to close the datacube yet !
it does not prevent to write new data inside P. The file system will automatically "cancel" the sparse blocks. Once again GDAL does not even have to care.

In this scenario, "sparsing" is a manual action, I don't expect GDAL to call it automatically. I only need a way to get the file range.

[edit] I understand that there is no guarantee that the file will remain sparse. If GDAL attempts to write 0s in the file, it will "unsparse". However, I don't want GDAL to be clever here. It delegates to me the responsibility of not updating the datacube (even with 0) where I know it is "empty". The risk is that some cache blocks overlaps "empty" parts of the file and does not allow an optimal sparsing. But for huge datacubes, the gain would be already substantial !

rouault commented 11 months ago

What about having a special behaviour in IRasterIO(GF_Write, ....) that would detect that the provided buffer is fully zero/nodata value and call FSCTL_SET_ZERO_DATA. Perhaps controlled by an open option to allow that ?

chacha21 commented 11 months ago

What about having a special behaviour in IRasterIO(GF_Write, ....) that would detect that the provided buffer is fully zero/nodata value and call FSCTL_SET_ZERO_DATA. Perhaps controlled by an open option to allow that ?

Using options could be an idea :

as you mention, on open, to let the filesystem apply the FSCTL_SET_SPARSE flag to the file
in RasterIO, a special flag claiming that all the data to be written should be 0, whatever the content of the source buffer (that could even be null in this case). This allows GDAL to call FSCTL_SET_ZERO_DATA
or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant

[edit]And there would be no obligation for GDAL to honour the flag. If for some reason sparsing is not callable, 0s would be actually written.

rouault commented 11 months ago

or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant

I don't think we need a special flag. A simple heuristics is to look for example at the 4 corners of the buffer + center pixel. If they are both zero, then the likelihood that the buffer is full zero becomes high and you can do the full check that it is actually only zero. That's actually what the GDALBufferHasOnlyNoData() function used by GTiff driver does to detect if a tile can be omitted.

chacha21 commented 11 months ago

or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant

I don't think we need a special flag. A simple heuristics is to look for example at the 4 corners of the buffer + center pixel. If they are both zero, then the likelihood that the buffer is full zero becomes high and you can do the full check that it is actually only zero. That's actually what the GDALBufferHasOnlyNoData() function used by GTiff driver does to detect if a tile can be omitted.

I might have a strong argument against the overhead of auto-detecting zeros, but I need to check first some behaviour of FSCTL_SET_ZERO_DATA

rouault commented 11 months ago

I might have a strong argument against the overhead of auto-detecting zeros

It would be worth benchmarking the actual cost of GDALBufferHasOnlyNoData()

chacha21 commented 11 months ago

I made some tests with FSCTL_SET_ZERO_DATA, and I want to highlight some behaviour details that I think are arguments against the overhead of an auto check for 0 data in GDAL.

Let's say that "Z" means "using FSCTL_SET_ZERO_DATA". Z at the end of a file is special, let's talk about Z in the middle of a file to "punch a hole"

The main thing is that FSCTL_SET_ZERO_DATA needs a minimal atomic size to be efficient. What I mean is :

if you Z less than 64KB, sparse does not seem not to occur (at least immediately, even after flushing or closing the file)
if you Z 32KB and then the next 32KB, sparse does not seem not to occur (at least immediately, even after flushing or closing the file) (I am surprised, I expected 4K, this value being my default NTFS cluster size. I also expect that it must be "aligned writes", but I did not test for the minimal alignment boundary) Effective sparsing for "incremental small Z" may occur afterwards by a filesystem maintenance "compacting" operation, but it is not relevant here.

As a consequence, it means that sparsing efficiency through GDAL would be relevant only for "big contiguous writes" in RasterIO().

However, in the context of ENVI file and BIL/BIP/BSQ layouts, I feel that it will be a problem.

For sparsing to be efficient, we must be sure that :

we are writing to the file, not to the cache
the data layout of the dst of the write must be contiguous enough

I am not sure that delegating this analysis to GDAL would relevant. Sometimes sparsing would occur, sometimes not. Forcing the user to fully endorse the efficiency of a Z could be less misleading. Thus, the user will be responsible to tell if the data is zeros, so no need to auto-check. And the user will know better anyway, it might be redundant to let GDAL check once more.

[edit] Another info related to GDAL's NO_DATA Microsoft claims that "The default data value of a sparse file is zero; however, it can be set to other values". https://learn.microsoft.com/en-us/windows/win32/fileio/sparse-files I did not find how to do that, but it might be relevant to know.

rouault commented 11 months ago

I would be rather opposed to having a new method that would take explicitly file offsets. That would feel really awkward in the GDAL Dataset/RasterBand API which is totally agnostic of such low level details. Better let the user do that outside of GDAL then

If that would be done by GDAL, perhaps a CPLErr GDALDataset::Fill(int nXOff, int nYOff, int nXSize, int nYSize, int nBands, const int* panBands, double dfFillReal, double dfFillImag) virtual method whose default implementation would call IRasterIO() ? RawDataset could implement that, compute the file offsets and decide if it must FSCTL_SET_ZERO_DATA (a corresponding method SetZero() would have to be added to VSIVirtualHandle).

chacha21 commented 11 months ago

The two-level solution GDALDataset::Fill() as RasterIO() by default and SetZero() if possible sounds good because:

it is harmless and trivial for datasets not supporting the feature
it fits pretty well in the GDAL API philosophy

However once again I feel that care must be taken.

As a user unaware of sparsing, I would expect such a dedicated function to be designed for efficiency in the case of uniform filling : using memset() when possible, stack-allocated buffers to craft good-sized writes of other multi-byte patterns... (like OpenCV's cv::Mat operator=(Scalar&) setTo(Scalar&)). So it's a lot of work here.
As a user, I would not expect this function to behave differently than RasterIO() regarding the cache... but it would !

Indeed, using Fill() with zeros to sparse the file requires either to also fill the cache with zeros, or drop the cache (I think that the filesystem will be pretty fast at returning dummy 0s for those sparsed regions, perhaps as fast as a memset())

dropping the cache is a problem if the Fill() does not exactly match cache blocks. It is likely to occur for BIP layouts, or just for small spatial region of interest. Then the user cannot know in advance if he will pay for the cost of invalidating the cache (flush and future reads in regions not zeroed)
filling the cache with zeros can be a waste of time, if the purpose of the user is really to sparse regions, because it is unlikely that he will have to fetch data (known to be zero) in those regions afterwards
the cache should not be a problem preventing sparsing to occur if trying to duplicate the dataset to a new file. Filesystem "sparse-copy" automatic support should be triggered for file copy, while the cache could hide the fact that some 0 regions are indeed sparse.

chacha21 commented 10 months ago

I drop a few more details about the sparse feature of NTFS :

https://learn.microsoft.com/en-us/answers/questions/1459700/ntfs-sparse-files-range-boundaries

There is indeed a magical un-documented 64KB boundary for minimal regions to be effectively sparsed The "non zero sparsed data" is a documentation error and is not supported

OSGeo / gdal

ENVI sparse datasets #8895