Velocidex / c-aff4

An AFF4 C++ implementation.
http://docs.aff4.org
Apache License 2.0
186 stars 35 forks source link

Image stream contains a sub-chunk size chunk #98

Open blschatz opened 5 years ago

blschatz commented 5 years ago

All chunks in the image stream should have an uncompressed size of chunkSize. In some instances libaff4 writes the last chunk without padding the uncompressed data to chunkSize. Suspect the problem is with flush().

This is important as decompressors (exp LZ4) are faster if they know the size of the decompressed buffer.

scudette commented 5 years ago

This does not make any sense - why would the last chunk need to be padded? It is obviously shorter than chunk size.

blschatz commented 5 years ago

v1 standard @ 3.2 "Storage of uncompressed chunks is supported by the simple principle that if len(chunk) == aff4:chunk_size then it is a stored chunk."

An implication of the above is that storing uncompressed chunks of less than chunkSize is not allowed. But what if I have a last chunk that is, for example, len(chunkSize)-512, that compresses to length=chunkSize.

By the spec, I'm not allowed to store it, so I need to store the compressed version, but a reader, on seeing that length == chunkSize, will think is is stored, per the spec. We have actually struck an edge case along these lines before. This can be worked around by considering the aff4:size of the ImageStream, but makes for a complex implementation.

Obviously the ImageStream abstraction isn't nailed down tight enough in the spec given our different interpretations here. I'd argue that it is conceptually simpler, simpler to implement and more performant to treat the Image Stream as strictly a "compressed chunk stream", rather than as a "compressed chunk stream except for the last chunk which isn't a chunk stream".

scudette commented 5 years ago

I think we have considered this case during the standardization process. If the compressed size is larger than chunk size, then the code will just store the uncompressed chunk. If the last chunk is smaller than chunk size it will be stored compressed - there is no problem there.

I was never too comfortable with explicitly allowing for this edge case (interleaving compressed and uncompressed chunks in the same image) - the benefit is saving a couple of bytes for the header at best but it opens up a whole array of edge cases like this one. Why don't we always stored the same type of chunk (always compressed or always uncompressed)?

blschatz commented 5 years ago

I propose updating the spec to the following: "Chunks written to the Image Stream must be of size chunkSize. When writing sub-chunkSize chunks, the remaining space must be padded with 0x00."

jtsylve commented 5 years ago

I think this is a good optimization, but I fear it's overly specific and too restrictive to be a requirement. I think the spec could benefit from a section on implementation notes and this might fit well there.

blschatz commented 5 years ago

It isn't an optimisation. The restriction has always been there: "Image streams specify the chunk_size attribute, as the number of image bytes each chunk contains (chunk size defaults to 32 kb)."[1]

Also, at p1 of the spec v1.0: "the Image Stream (a seekable, contiguous sequence of fixed sized data blocks, optionally compressed)." [2]

The above proposed spec change only serves to reinforce this.

I'd reverse the question. Why are we considering relaxing this restriction now?

[1] AFF4 (2019) paper. [2] AFF4 Standard v1.0

jtsylve commented 5 years ago

I guess it just seems overly restrictive for little gain. You can easily know the size of the last chunk via modular arithmetic and the aff4:size property. It's not a hill I'm willing to die on, though. Just my $0.02