Open chacha21 opened 11 months ago
have you looked at https://gdal.org/development/rfc/rfc63_sparse_datasets_improvements.html ? and the Truncate() and GetRangeStatus() in VSIVirtualFileHandle ? They are implemented by the Win32 API. You might just have to implement IGetDataCoverageStatus() in RAWRasterBand
have you looked at https://gdal.org/development/rfc/rfc63_sparse_datasets_improvements.html ?
This is the link I quoted, but it does not mention Truncate()
, I thought it was more related to Getting information than actually "sparsing" data.
Probably that ENVIDataset::Close() in the bFillFile case should use Truncate() (which can actually enlarge file) rather than Seek(file_size) + Write(single zero). I don't remember the details if Seek()+Write() is considered as sparse file. Probably not
I think we are not talking of the same use case. Let me describe in a few words :
IGetDataCoverageStatus()
In this scenario, "sparsing" is a manual action, I don't expect GDAL to call it automatically. I only need a way to get the file range.
[edit] I understand that there is no guarantee that the file will remain sparse. If GDAL attempts to write 0s in the file, it will "unsparse". However, I don't want GDAL to be clever here. It delegates to me the responsibility of not updating the datacube (even with 0) where I know it is "empty". The risk is that some cache blocks overlaps "empty" parts of the file and does not allow an optimal sparsing. But for huge datacubes, the gain would be already substantial !
What about having a special behaviour in IRasterIO(GF_Write, ....) that would detect that the provided buffer is fully zero/nodata value and call FSCTL_SET_ZERO_DATA. Perhaps controlled by an open option to allow that ?
What about having a special behaviour in IRasterIO(GF_Write, ....) that would detect that the provided buffer is fully zero/nodata value and call FSCTL_SET_ZERO_DATA. Perhaps controlled by an open option to allow that ?
Using options could be an idea :
open
, to let the filesystem apply the FSCTL_SET_SPARSE flag to the file[edit]And there would be no obligation for GDAL to honour the flag. If for some reason sparsing is not callable, 0s would be actually written.
- or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant
I don't think we need a special flag. A simple heuristics is to look for example at the 4 corners of the buffer + center pixel. If they are both zero, then the likelihood that the buffer is full zero becomes high and you can do the full check that it is actually only zero. That's actually what the GDALBufferHasOnlyNoData() function used by GTiff driver does to detect if a tile can be omitted.
- or, in RasterIO, a special flag requesting that incoming data should be checked against 0 to allow FSCTL_SET_ZERO_DATA if relevant
I don't think we need a special flag. A simple heuristics is to look for example at the 4 corners of the buffer + center pixel. If they are both zero, then the likelihood that the buffer is full zero becomes high and you can do the full check that it is actually only zero. That's actually what the GDALBufferHasOnlyNoData() function used by GTiff driver does to detect if a tile can be omitted.
I might have a strong argument against the overhead of auto-detecting zeros, but I need to check first some behaviour of FSCTL_SET_ZERO_DATA
I might have a strong argument against the overhead of auto-detecting zeros
It would be worth benchmarking the actual cost of GDALBufferHasOnlyNoData()
I made some tests with FSCTL_SET_ZERO_DATA, and I want to highlight some behaviour details that I think are arguments against the overhead of an auto check for 0 data in GDAL.
Let's say that "Z" means "using FSCTL_SET_ZERO_DATA". Z at the end of a file is special, let's talk about Z in the middle of a file to "punch a hole"
The main thing is that FSCTL_SET_ZERO_DATA needs a minimal atomic size to be efficient. What I mean is :
As a consequence, it means that sparsing efficiency through GDAL would be relevant only for "big contiguous writes" in RasterIO().
However, in the context of ENVI file and BIL/BIP/BSQ layouts, I feel that it will be a problem.
For sparsing to be efficient, we must be sure that :
I am not sure that delegating this analysis to GDAL would relevant. Sometimes sparsing would occur, sometimes not. Forcing the user to fully endorse the efficiency of a Z could be less misleading. Thus, the user will be responsible to tell if the data is zeros, so no need to auto-check. And the user will know better anyway, it might be redundant to let GDAL check once more.
[edit] Another info related to GDAL's NO_DATA Microsoft claims that "The default data value of a sparse file is zero; however, it can be set to other values". https://learn.microsoft.com/en-us/windows/win32/fileio/sparse-files I did not find how to do that, but it might be relevant to know.
I would be rather opposed to having a new method that would take explicitly file offsets. That would feel really awkward in the GDAL Dataset/RasterBand API which is totally agnostic of such low level details. Better let the user do that outside of GDAL then
If that would be done by GDAL, perhaps a CPLErr GDALDataset::Fill(int nXOff, int nYOff, int nXSize, int nYSize, int nBands, const int* panBands, double dfFillReal, double dfFillImag) virtual method whose default implementation would call IRasterIO() ? RawDataset could implement that, compute the file offsets and decide if it must FSCTL_SET_ZERO_DATA (a corresponding method SetZero() would have to be added to VSIVirtualHandle).
The two-level solution GDALDataset::Fill()
as RasterIO()
by default and SetZero()
if possible sounds good because:
However once again I feel that care must be taken.
cv::Mat
operator=(Scalar&) setTo(Scalar&)). So it's a lot of work here.Indeed, using Fill() with zeros to sparse the file requires either to also fill the cache with zeros, or drop the cache (I think that the filesystem will be pretty fast at returning dummy 0s for those sparsed regions, perhaps as fast as a memset())
I drop a few more details about the sparse feature of NTFS :
https://learn.microsoft.com/en-us/answers/questions/1459700/ntfs-sparse-files-range-boundaries
There is indeed a magical un-documented 64KB boundary for minimal regions to be effectively sparsed The "non zero sparsed data" is a documentation error and is not supported
Currently, GDAL supports a "sparse" feature for GTiff (see here) I am interested in seeing such a feature for ENVI datasets. I understand that it is something that might not fit at all in the code base.
However, I was looking at the doc of FSCTL_SET_ZERO_DATA for NTFS filesystems. Actually, it seems reasonable to manually call FSCTL_SET_ZERO_DATA for parts of the dataset that I know being empty, without expecting GDAL to detect and do that automatically.
There could be different strategies :
a dataset might have some
declareEmpty(data_range)
API not especially related to the "NO_DATA" value. Internally, GDAL could call FSCTL_SET_ZERO_DATA if it can be mapped to a file range (depending of dataset and data layout). The cache would be automatically informed.an ENVI dataset might have something similar to H5Dget_chunk_info to help the user compute the file part matching a dataset data range.
Is this realistic as a feature request ?