Open AraHaan opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-io-compression See info in area-owners.md if you want to be subscribed.
Author: | AraHaan |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.IO.Compression`, `untriaged` |
Milestone: | - |
I have modified the proposal somewhat to add a few things I missed to ensure that TryDecompress will never throw or get exceptions that it would need to catch.
@AraHaan thank you so much for the detailed proposal.
This proposal would partially address both https://github.com/dotnet/runtime/issues/42820 and https://github.com/dotnet/runtime/issues/39327 . I don't think this PR depends on the first one, we can consider that one the "uber issue", because it was attempting to address the same you're addressing, but for all compression stream classes.
I see you added your own ZLib-specific operation status enums. Contrary to BrotliDecoder
/BrotliEncoder
, which reused System.Buffers.OperationStatus
: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Private.CoreLib/src/System/Buffers/OperationStatus.cs#L10
Some feedback on the new enum values:
Ok
and Done
seem redundant. We should simply keep Done
, just like OperationStatus.Done
.DestinationTooSmall
seems fine. It's aligned to OperationStatus.DestinationTooSmall
already.OperationNotCompression
, OperationNotDecompression
, Disposed
, InitError
, ErrorNo
. Shouldn't those throw instead?VersionError
mean?DataError
, StreamEnd
, StreamError
, BufError
, MemError
, NeedDictionary
, could be aligned to OperationStatus.InvalidData
.DataError
could be OperationStatus.NeedMoreData
.What's preventing us from just reusing OperationStatus
?
I'm not sure we need OperationNotCompression, OperationNotDecompression, Disposed, InitError, ErrorNo. Shouldn't those throw instead?
~~We would need these for TryDecompress
/ TryCompress
where it would be expansive to throw and then "catch" the exception within the Try
version.
VersionError is an zlib error code for when the expected zlib version and the runtime zlib version does not match (on the major version I think).~~
I guess OperationStatus could add an generic Error
onto it at the end of it for these above (see edited proposal above).
According to the Manual DataError is for when "inflate detects an error in the zlib compressed data format, which means that either the data is not a zlib stream to begin with, or that the data was corrupted somewhere along the way since it was compressed."
Also in that manual StreamEnd is returned when inflate/deflate is at the end of the zlib/deflate/gzip stream.
https://zlib.net/zlib_how.html is where I get some of the details as well too.
As for BufError, that might fall under NeedMoreData
.
Edit: I have also removed ZlibOperationStatus with a minor change to System.Buffers.OperationStatus.
cc: @carlossanlop I think this might just now be ready for review (hopefully).
cc: @dotnet/area-system-io-compression I think this is ready for review.
Anything else stalling this?
I would like to see some way how many bytes of the original source were read to generate the output, to allow the next decompress to continue right there. Currently the implementation only shows the output buffer results, while the input buffer is just as useful.
This is especially so in case some other information (e.g. not compressed) is written right after the zlib data, such as within Git package files. The current implementation is nice if you have some outer frame handling that handles/hides overreading input for you, but the zlib streaming was designed to do the framing for you.
I think that can be accessed with the LastBytesWritten
property (I think it gets changed on every pass).
However I do agree with another thing as well, perhaps there should be an TotalBytesWritten
as well for the total amount written in the instance of the Encoder/Decoder.
And possibly an Reset
function that can act as an way to reset those 2 get only properties for those who want to be able to reuse it to compress/decompress multiple different sets of input data.
@AraHaan I also need the LastBytesRead for this case. (The NextInIndex/ next_in_index as it is called in most zlib wrappers), to allow continuing right after the input that was processed. Not just NextOut*.
Using streams, we need to scavenge the input buffer for raw processing as GIT packs delimit compressed data with uncompressed headers without specifying the compressed size. See #61405 for a working hack.
Decompress()
in the proposed API would need to return the number of bytes read from the source and that written to the destination, along with the operation status.
public sealed struct OperationResult
{
public readonly OperationStatus Status;
public readonly int Read, Written;
}
Also we have to consider if Read/Written should be int
, uint
, long
or, ulong
for large data sets (since large files can also be compressed that would normally overflow an Int32 (some I think can even overflow an int64).
However the native implementation I believe uses uint's for the sizes (size_t
or something along the lines of that).
So it is something to consider to support files over 2 gigabytes in length as well, however such support is tricky to get correct.
Also the implementation will also require that adler32
be exposed (currently only crc32
is exposed) in System.IO.Compression.Native to work (see https://github.com/AraHaan/runtime/commit/bebf43f63dbc7d9febf363c797b6428a99dd0269 for all the code changes that would need to be made in System.IO.Compression.Native and System.IO.Compression).
I would love to move ZlibEncoder and ZlibDecoder to System.IO.Compression.ZlibEncoder but it would need InternalsVisibleTo
on it to see types in System.IO.Compression
. A con to this would be that it would be another library to maintain in dotnet/runtime.
Both Stream and Span use Int as transfer format. ZLib uses an 32 bit value, which overflows in some well defined cases where third party code depends on.
Crc32 is now also exposed via System.IO.Hashing, so maybe Adler32 belongs there too? (I see some minor sorting issues in the .def files in your current development version. Not sure if sorting adler before crc would be necessary, but it breaks the alphabetic ordering in that region)
crc32 is exported in System.IO.Compression.Native as well, but with my changes I also added adler32 which is also a part of the native zlib impl as well.
I guess an option could be to replace the proposed LastBytesRead/LastBytesWritten
properties on the encoder / decoder with an ZlibResult
struct that is as follows:
public sealed struct ZlibResult
{
public readonly OperationStatus Status { get; } // internal setter.
public readonly int Read { get; } // internal setter.
public readonly int Written { get; } // internal setter.
}
And then the operation function could return an instance of the struct that contains the information they need.
That's what I said 8 days ago.
I am going to work on this a bit more and will consider moving the properties to an ZlibResult
type that is Disposable instead so that way the Encoder and decoder types could be non-disposables..
Alright just got done with modifying the API locally.
@carlossanlop
* What does `VersionError` mean?
if the zlib library version is incompatible with the version assumed by the caller
can also happen whenThis can also occur if the size of the z_stream structure made in the C program using the description in zlib.h does not match the size of the compiled z_stream structure libz.so.1.
from here.
Background and motivation
Currently there is no non-stream based apis to zlib compression. As such I feel like an encoder / decoder implementation is needed similar to the Brotli implementation.
The brotli implementation also uses the encoder / decoder internally in the streams, which can help make the implementations of the zlib based streams (GZipStream, DeflateStream, and ZLibStream) better.
A single ZlibEncoder and ZlibDecoder that takes a class of values (ZlibOptions), where ZlibOptions also has subclasses named DeflateOptions, and GZipOptions where only the window bits are different.
This issue partially addresses:
Implementing this issue should also resolve this one as well:
I currently have a baseline implementation locally of this
(except for the stream changes that would need to be done),and it should be ready by the time .NET 7 goes into an api freeze (until .NET 8's development starts).API Proposal
API Usage
Alternative Designs
I wanted to use the System.Buffers.OperationStatus enum, but I felt it lacked enough members to denote the zlib specific status codes (where they would be used for zlib, deflate, and gzip compression / decompression).Likewise adler32's and the total_out member from zlib are represented as unsigned, I do not know if I should have the encoder / decoder output signed, or leave them unsigned and keep the CLS-Compliancy issues suppressed from it.Risks
Minimal
Changelog
CalculateChecksum
method.bytesWritten
unsigned output parameter from (Try)Compress and (Try)Decompress and moved it to a property that stores the last amount of bytes written by (Try)Compress and (Try)Decompress.uint
toint
.