Closed GoogleCodeExporter closed 9 years ago
You can use snappy_unittest for this; it's crude, but it works.
What's the intended use case? For disk-to-disk compression, usually you have
CPU time for something like gzip -1 instead.
Original comment by se...@google.com
on 18 Apr 2011 at 7:12
It would just be useful in the same way that lzop is useful, as a general
pipeline tool. E.g. disk-to-different-disk, process-to-ssh, etc.
Original comment by yaa...@gmail.com
on 22 Apr 2011 at 2:40
Original comment by se...@google.com
on 26 Apr 2011 at 12:56
Attached patch:
1) Adds streaming support, at least for streams created with the current
compressor
2) Creates command line tools =snzip= and =snunzip=. Both work solely with
standard input and output, making them most useful for pipes.
Resulting tool passes basic sanity checks (compress/decompress) and seems to
have acceptable performance. Has the limitation that for files larger than
64K, the reported file size will differ from the actual file size (since the
header must be output before the entire stream is recieved).
My C++ is rusty to nonexistant, so style/culture fixes are welcome.
Original comment by PavPanch...@gmail.com
on 17 Jun 2011 at 9:24
Attachments:
I made another patch, snzip.dif, which makes snzip.
It has similar options as gzip and bzip2 have as follows.
To compress file.tar:
snzip file.tar
Compressed file name is 'file.tar.snz' and the original file is deleted.
Timestamp, mode and permissions are not changed as possible as it can.
To compress file.tar and output to standard out.
snzip -c file.tar > file.tar.snz
or
cat file.tar | snzip > file.tar.snz
To uncompress file.tar.snz:
snzip -d file.tar.snz
or
snunzip file.tar.snz
Uncompressed file name is 'file.tar' and the original file is deleted.
Timestamp, mode and permissions are not changed as possible as it can.
If the program name includes 'un' such as snunzip, it acts as '-d' is set.
To uncompress file.tar.snz and output to standard out.
snzip -dc file.tar.snz > file.tar
snunzip -c file.tar.snz > file.tar
snzcat file.tar.snz > file.tar
cat file.tar.snz | snzcat > file.tar
If the program name includes 'cat' such as snzcat, it acts as '-dc' is set.
It have been tested on Linux and will work on other unix-like OSs.
As for Windows, it needs a getopt(3) compatible function, which is found in
many places as a public domain function.
Original comment by kubo.tak...@gmail.com
on 31 Jul 2011 at 12:12
Attachments:
Sorry, I failed to attach a correct file.
I attached a new one.
Original comment by kubo.tak...@gmail.com
on 31 Jul 2011 at 12:16
Attachments:
kubo your patch seems to work well; i did have to make one change for missing
'PACKAGE_STRING' and it was not being compiled correctly by default when i do
'make snzip', but the utility is exactly what i was looking for. I've also
added a -v to print out the version 1.0.3
Original comment by jehiah
on 12 Aug 2011 at 7:52
I made a new patch to support mingw32 and cygwin.
> kubo your patch seems to work well; i did have to make one change for missing
'PACKAGE_STRING' and it was not being compiled correctly by default when i do
'make snzip', but the utility is exactly what i was looking for. I've also
added a -v to print out the version 1.0.3
The missing macro 'PACKAGE_STRING' is defined in config.h by autoconf.
What version of autoconf do you use? I'm using autoconf 2.65.
I also prefer the '-v' option to print out the version. But gzip and bzip2
use it for verbose output option. So I didn't add it.
Original comment by kubo.tak...@gmail.com
on 21 Aug 2011 at 9:37
Attachments:
>> i did have to make one change for missing 'PACKAGE_STRING' and it was not
being compiled correctly by default when i do 'make snzip'
Could you provide your changes?
Original comment by and...@inffinity.com
on 22 Sep 2011 at 2:31
guys im a litle confused here, shouldnt the download of snappy.h allow you to
simply run this command:
snappy::Compress('/tmp/testfileinput', '/tmp/testfileoutput');
from within your c++ code? just two simple string inputs?
Original comment by mina.mou...@hotmail.com
on 24 Sep 2011 at 5:00
@mina.moussa :
snappy::Compress reads the full input, performs compression, writes the full
output.
imagine you have 5 TB of data to compress ... what do you do? well, you can buy
lots of ram and harddisks to swap to while the compression happens.
or better yet you can write a loop that reads in chunks of the file, runs them
through snappy::Compress and writes each chunk to an output file with a
container format that can later be decompressed by reading in discrete chunks
and decompressing them.
though i haven't played with these command line tools, if they behave properly
they should allow you to avoid having to come up with a container file format
and avoid writing loops for working on small chunks of the input at a time by
enabling streaming of input to the tool which would stream
compressed/decompressed output.
Original comment by dwil...@builderadius.com
on 26 Sep 2011 at 2:49
Yes, this is probably what you'd want for a command-line tool supporting pipes:
A simple framing format. For each block, probably the compressed length (the
uncompressed length is already in the format), perhaps some flags (EOF?), and
the CRC32c of the uncompressed data in that block.
Original comment by se...@google.com
on 26 Sep 2011 at 2:54
We have a simple framing format for streaming in the Java port of Snappy:
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/SnappyO
utputStream.java
Each 32k block is preceded by a 3-byte header, which is a 1-byte flag
indicating if the block is compressed or not, and a 2-byte length of the block.
Our main requirements were speed and the ability to concatenate compressed
files. The gzip format allows concatenation, but the common Java libraries
don't support this. We avoided writing a checksum for simplicity and speed.
The format doesn't currently have a header (magic number), but using a whole
byte for the compressed flag allows adding one later.
It would be nice to have a standard streaming format and tools. We're going to
try to get the Hadoop project to use this format too (which is our primary use
case).
Original comment by electrum
on 26 Sep 2011 at 5:24
The ability to concatenate is an interesting feature. Something that would
combine this with the ability to detect file format would be the best, though,
so you won't need yet another container format for that.
Not doing checksumming sounds a bit suboptimal; you can do it really cheaply on
modern CPUs (gigabytes per second per core), especially since the data is
already going to be in the L1 cache. Especially with multiple implementations
starting to float around (Java vs. C++ vs. Go), it's easy to get something
subtle going wrong.
Original comment by se...@google.com
on 28 Sep 2011 at 10:28
Steinar, you have a good point about checksums.
We updated the stream format to contain the masked CRC32C of the input data,
providing protection against corruption or a buggy implementation. We also
added a file header "snappy\0", which happens to be the same size (7 bytes) as
the block header. The file header may procede any block header one or more
times, thus supporting concatenation including "empty" files (that contain only
the file header).
See the SnappyOutputStream link above for the formal description. Does this
format sound reasonable to standardize?
Original comment by electrum
on 30 Sep 2011 at 8:32
OK, this starts to sound pretty good to me -- I should probably get somebody
else in here to look at it as well, but it starts to become reasonable.
Some questions (mostly nits):
- What do you need the uncompressed/compressed flag for? In what situations would you want to store the data uncompressed?
- Is the length 16-bit signed or unsigned? Why is it 32768 and not 32767 or 65535?
- Should the lengths really be stored big-endian, when all other numbers in Snappy are stored little-endian?
- Can you verify that the CRC32c polynomial you're using is compatible with what the SSE4 CRC32 instruction computes? It sounds reasonable that if we're defining a new format, an implementation in native code should be able to make use of that instruction.
Thanks!
Original comment by se...@google.com
on 3 Oct 2011 at 9:50
Some drive-by comments:
For the uncompressed/compressed flag, leveldb's tables uses snappy, but if the
compression doesn't save more than 12.5% of the bytes, then the block is left
uncompressed on disk:
http://code.google.com/p/leveldb/source/browse/table/table_builder.cc#147
For checksums, it looks like github.com/dain is using the same CRC32c-based
checksum as leveldb:
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/Crc32C.
java
http://code.google.com/p/leveldb/source/browse/util/crc32c.h#28
Original comment by nigel.ta...@gmail.com
on 3 Oct 2011 at 10:32
Here is a concrete proposal. It is possibly too complicated, but it does let a
.snappy file start with a 7-byte magic header, and also allows concatenating
multiple .snappy files together.
The byte stream is a series of frames, and each frame has a header and a body.
The header is always 7 bytes. The body has variable length, in the range [0,
65535].
The first header byte is flags:
- bit 0 is comment,
- bit 1 is compressed,
- bit 2 is meta,
- bits 3-7 are unused.
The comment bit means that the rest of the header is ignored (including any
other flag bits), and the body has zero length. Thus, "sNaPpY\x00" is a valid
comment header, since 's' is 0x73.
For non-comment headers, the remaining 6 bytes form a uint16 followed by a
uint32, both little-endian. The uint16 is the body length. The uint32 is a
CRC32c checksum, the same as used by leveldb. This differs from the Java code
linked to above in that it's little-endian (like the rest of Snappy), and the
maximum body length is 65535, not 32768.
The compressed bit means that the body is Snappy-compressed, and that the body
length and checksum refer to the compressed bytes. If the bit is off, the body
is uncompressed, and the body length and checksum refer to the uncompressed
bytes. Each frame's compression is independent of any other frame.
The meta bit means that the body is metadata, and not part of the data stream.
This is a file format extension mechanism, but there are no recognized
extensions at this time.
A conforming decoder can simply skip every frame with the comment or meta bits
set.
Original comment by nigel.ta...@gmail.com
on 4 Oct 2011 at 11:03
I've written a Go implementation of that proposal at
http://codereview.appspot.com/5167058. It could probably do with a few more
comments, but as it is, it's about 250 lines of code.
I added an additional restriction that both the compressed and uncompressed
lengths of a frame body have to be < 65536, not just the compressed length.
This restriction means that I can allocate all my buffers up front. Thus, once
I've started decoding, I don't need to do any extra mallocs regardless of how
long the stream is, or whether the uncompressed stream data looks like
"AAAAAAAA...".
Original comment by nigel.ta...@gmail.com
on 4 Oct 2011 at 1:15
Answers to Steinar's questions:
Why the uncompressed/compressed flag? As mentioned above by Nigel, for the
same reason that leveldb does it. Because Snappy's goal is speed, and doesn't
compress well compared to slower algorithms like zlib, it makes sense to
sacrifice a little more space for speed. (We chose the same cutoff as leveldb,
12.5%, but the cutoff is independent of the format.)
The 16-bit length is unsigned. Why 32768 and not 65535? Two reasons. First,
it matches Snappy's internal block size. Because Snappy will split larger
blocks, the only potential gain is fewer chunk headers. Second, it is a power
of two. If you use 65535 and compress 64k (65536) bytes of data, then you end
up with two chunks, with the second chunk being only 1 byte.
Should the length be big endian or little endian? We chose big endian because
that's common for file formats and network protocols. Given that Snappy uses
little endian, I have no objections to changing it.
The CRC32C was chosen specifically to be compatible with the SSE4 instruction.
It's a bug if it's not. The Java implementation uses the CRC32C code from
Hadoop, which we haven't verified extensively, but it matched in cursory checks
against the Python leveldb reader.
Original comment by electrum
on 5 Oct 2011 at 6:03
Nigel, I'm curious why you have selected a bit-flag encoding for the header
byte instead of an enumeration of values. I like bit-flags when the most of
combinations are valid, but in the three flags identified, comment and meta
would not combine well with each other or with compressed. Alternatively, I
propose we use the following explicit enumeration for the header bit:
0x00: uncompressed data
0x01: snappy compressed data
0x73: stream header
If the code is 0x73 (ascii 's') then the frame header block must be exactly
"snappy\0". All other codes are reserved. This large reserved space allows
for easy extension of the file format in the future.
I also suggest, we require the stream header at the beginning of the file
instead of making it optional.
Original comment by d...@iq80.com
on 5 Oct 2011 at 6:21
Regarding the checksum, I thought Steinar had a very good point about the
checksum protecting against bad encoders/decoders. Thus, the checksum should
always be of the original data, providing end-to-end protection for the user's
data.
Original comment by electrum
on 5 Oct 2011 at 6:51
I'm happy to go with 32768 instead of 65535, given that kBlockSize == 32768 in
C++.
Reserving all codes other than 0x00, 0x01 and 0x73 could work, but
extension/metadata frames can have bodies, and bodies can also be compressed or
uncompressed. I think it's just as easy to make meta-ness a bitfield bit,
orthogonal to compression being a bitfield bit. An earlier (unpublished) design
also had a meta-continue bit, in case the metadata's body was longer than 65535
bytes, but I decided to leave that out until we actually have metadata to
specify. Since I had three bits, then I figured that comment might as well be a
bit too. Sure, not every bit combination is valid, but a lot of the bits are
orthogonal.
Regarding the magic string, the weird capitalization of "sNaPpY\x00" is
deliberate, to lessen the chance for a false positive. Also, I'm still leaning
towards optional instead of mandatory, but I could be convinced otherwise.
Regarding the checksum, leveldb computes the checksum of the compressed bytes,
not the uncompressed bytes. I don't know the reason for that, but I'm guessing
that it was a deliberate decision. I'll ask.
Original comment by nigel.ta...@gmail.com
on 6 Oct 2011 at 12:51
Actually, it's pretty superficial, but what really bugs me is how the 0x73
sticks out from everything else. What if we made the magic header "\x00sNaPpY",
with the nul byte at the fromt. Thus:
0x00 stream header - the remaining six bytes must be "sNaPpY".
0x01 uncompressed body
0x02 compressed body
Anything else is reserved.
For anything header not starting with 0x00, and the remaining six bytes are a
2-byte body length (up to 32768) and a 4-byte checksum (of the body's bytes on
the wire, i.e. compressed).
Original comment by nigel.ta...@gmail.com
on 6 Oct 2011 at 1:57
I forgot to say that I chose to checksum the compressed bytes because the body
for an extension frame may or may not be compressed, but it's unspecified how
that is indicated by the opening header byte, and so a version 1.0 decoder
won't know when to decompress the body, if it needed to checksum the
uncompressed bytes.
Original comment by nigel.ta...@gmail.com
on 6 Oct 2011 at 3:04
What do you mean by version 1.0 decoder? Is there already an established format
that we need to be backwards compatible with?
The two primary reasons I can see for making the checksum on the uncompressed
data are:
1. Wider protection; you'll guard against not only implementation errors, but also bit-flip errors during compression.
2. You can run it in parallel with the compression if you need to.
I'm not sure if a maximum _compressed_ size of 32768 bytes makes sense; that
essentially _forces_ the compressor to add logic to copy the uncompressed data
if it didn't compress, which is an extra copy you don't want. If you really
can't live with 65535 as the limit, I'd suggest MaxCompressedLength(32768) =
38261.
Original comment by sgunder...@bigfoot.com
on 6 Oct 2011 at 9:34
There is no established format that we need to be backwards compatible with. By
"version 1.0 decoder", I mean whatever we decide to do here.
What we have been discussing in the last dozen or so comments on this bug is a
format that allows for extensions, but does not define any. If we decide to add
an extension in the future (e.g. speeding up random seeks into a .snappy file),
I'd simply like to ensure that any such extension won't break the decoding
algorithm we decide upon here.
As for 'forcing' the compressor to add copy logic, I don't think it's
problematic. The compression code can't compress directly to a sink because it
needs to precede the data with a header saying how many data bytes to expect,
and you don't know the length until after you've done the compression. Thus, it
needs to compress to a buffer in memory.
The source (uncompressed) bytes are also already buffered in memory.
Compression does not modify the source bytes. Thus, being able to write an
uncompressed frame simply requires being able to choose which buffer to copy to
the sink. I don't think that there's any unnecessary copying.
Original comment by nigel.ta...@gmail.com
on 6 Oct 2011 at 12:04
> You can run it in parallel with the compression if you need to.
Symmetrically, if the checksum is over the _compressed_ bytes, you can run it
in parallel with the decompression.
But my not-based-on-any-experiments expectation is that I/O bandwidth would be
the bottleneck in practice, and the time spent on checksumming would be
relatively insignificant.
Original comment by nigel.ta...@gmail.com
on 6 Oct 2011 at 12:11
Here's another thinking-out-loud comment.
If you don't see the need to send uncompressed frames, and you limit the
compressed body to 40960 bytes (0xA000), you can shorten the header to six
bytes. The first two bytes form a little-endian uint16.
If that uint16 is 65535, the remaining four bytes must be "sNpY", and the body
has zero length. The stream must start with at least one of these frames.
If that uint16 is in [0, 40960], that uint16 is the length of the compressed
body, and the remaining four bytes is the little-endian checksum.
If that uint16 is in [40961, 65534], then this is an extension. The remaining
four bytes (as a little-endian uint32) is the body length: the number of bytes
to skip if the extension is unrecognized. If an extension uses a checksum, that
checksum is given in the frame body instead of the frame header.
Original comment by nigel.ta...@gmail.com
on 7 Oct 2011 at 1:23
> limit the compressed body to 40960 bytes
I forgot to also say that the uncompressed body is at most 32768 bytes.
Original comment by nigel.ta...@gmail.com
on 7 Oct 2011 at 1:34
FYI.
> I forgot to also say that the uncompressed body is at most 32768 bytes.
I tested the uncompression speeds for various uncompressed body sizes (such as
8k, 16k, 32k, 64k, ...) by using snzip.
The best size was 64k on my Linux box. The test data was uncompressed linux
kernel tarball.
The best size will depends on the hardware spec. I guess it depends on the
ratio of I/O bandwidth and CPU speed.
Original comment by kubo.tak...@gmail.com
on 7 Oct 2011 at 3:24
proposal based on yours:
* frame header is six bytes
* uint16 fh0 is first two bytes (little endian)
* uint32 fh1 is next four bytes (little endian)
if PREDICT_FALSE(fh0 & 0xc000) { /* special frame */
if PREDICT_FALSE(fh0 == 65535) {
/* stream header; fh1 must contain "sNpY" */
}
else {
/* fh1 is length of frame body [0..2^32)
fh0 == 65534: frame body is uncompressed data
fh0 == 65533: frame body offset 0 .. (fh1 - 4) is uncompressed data.
frame body offset (fh1 - 4) .. fh1 is crc32c of the data.
40960 <= fh0 <= 65532: extension
}
}
else { /* not a special frame; is a compressed data frame */
/* compressed_length = fh0
fh1 contains crc32c of uncompressed version of the data */
}
Original comment by daniel.l...@gmail.com
on 8 Oct 2011 at 12:33
correction:
if PREDICT_FALSE(fh0 >= 0xc000) { /* special frame */
Original comment by daniel.l...@gmail.com
on 8 Oct 2011 at 12:46
arggh, second correction; just can't get that line right:
if PREDICT_FALSE(fh0 >= 0xa000) { /* special frame */
Original comment by daniel.l...@gmail.com
on 8 Oct 2011 at 12:56
I have released snzip.
https://github.com/kubo/snzip
This is basically same with snzip posted at comment 8.
The difference is that this is written by C, not by C++.
I don't stick to the current snzip format.
I released it to test various formats.
Original comment by kubo.tak...@gmail.com
on 8 Oct 2011 at 8:24
snzip was updated to support file formats of snappy-java and Dain's snappy in
java.
https://github.com/kubo/snzip
I may add the stream format proposed at comment 32 after a week.
Original comment by kubo.tak...@gmail.com
on 9 Oct 2011 at 9:09
I still prefer the proposal in comment #24 over the ones in #29 or #32. I still
don't see the problem in allowing uncompressed frame bodies, and in #32, I
don't like how the checksum for uncompressed frames is in different places (if
present at all) compared to the checksum for compresed frames.
For me, #24 is still the most regular: 1 byte flags, 2 byte length, 4 byte
checksum.
Original comment by nigel.ta...@gmail.com
on 10 Oct 2011 at 6:39
I have stopped to add the format proposed at comment #32 to snzip.
I make another proposal. It is simple and extensible.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.
= Overview
A stream is a sequence of frames. A frame consists of frame type (1
byte), data length (2 bytes, little endian) and data. Frame types are
defined as follows:
frame type:
0x00 stream header. The data MUST be "snappy".
0x01 compressed data frame
0x02 uncompressed data frame
0x03 end of stream
0x04 - 0x3F reserved
0x40 - 0x7F implementation-specific types
0x80 implementation name, such as "snzip 0.0.3."
0x81 comment
0x82 checksum (CRC-32)
0x83 checksum (CRC-32C)
0x84 - 0xBF reserved
0xC0 - 0xFF implementation-specific types
An implementation MUST support 0x00 - 0x03, SHOULD support 0x82 and MAY
support other types. If it finds an unsupported type, it SHOULD stop
reading the stream when "(type & 0x80) == 0" and SHOULD ignore it when
"(type & 0x80) != 0".
The data length before compression MUST be less than or equal to 32k.
= Simple case
A stream MUST start with a stream header frame "\x00\x06\x00snappy" and
end with a end-of-stream frame "\x03\x00\x00". A simple stream without
checksum is:
Example 1: simple stream without checksum
Frame 1: "\x00\x06\x00snappy" (stream header)
Frame 2: compressed or uncompressed data frame
....
Frame 100: compressed or uncompressed data frame
Frame 101: "\x03\x00\x00" (end-of-stream)
= Checksum
I choose CRC-32 for checksum because CRC-32 is widely used and Java
supports it as a standard class java.util.zip.CRC32. You can use
CRC-32C as a option if you want to use a new instruction CRC32 in
SSE4.2 to speed up. Note that the checksum data may be ignored if a
reader doesn't support it.
To use checksum (CRC-32), a checksum start frame "\x82\x00\x00" SHOULD
be just after the stream header frame and a checksum data frame
"\x82\x04\x00" + (little endian 4-byte data) SHOULD be just before the
end-of-stream frame.
The checksum data MUST be calculated from all bytes in frames between
the previous nearest checksum start frame and a frame just before the
checksum data frame.
Example 2: A checksum for a stream
Frame 1: "\x00\x06\x00snappy" (stream header)
Frame 2: "\x82\x00\x00" (checksum start)
Frame 3: compressed or uncompressed data frame
....
Frame 101: compressed or uncompressed data frame
Frame 102: "\x82\x04\x00" + checksum of Frame 2 to Frame 101
Frame 103: "\x03\x00\x00" (end-of-stream)
If a reader implementation support the checksum type, it MUST start
checksum from the first checksum start frame, update the checksum
value for each frame and compare it with the value in a checksum data
frame.
Any number of checksum data frames MAY be inserted after a checksum
start frame. Any number of checksum start frames MAY be inserted any
places.
If a second or succeeding checksum start frame is found, the checksum
value MUST be reset.
Example 3: checksum for each frame
Frame 1: "\x00\x06\x00snappy" (stream header)
Frame 2: "\x82\x00\x00" (checksum start)
Frame 3: compressed or uncompressed data frame
Frame 4: "\x82\x04\x00" + checksum of Frame 2 and Frame 3
Frame 5: compressed or uncompressed data frame
Frame 6: "\x82\x04\x00" + checksum of Frame 2 to Frame 5
....
Frame 101: compressed or uncompressed data frame
Frame 102: "\x82\x04\x00" + checksum of Frame 2 to Frame 101
Frame 103: "\x03\x00\x00" (end-of-stream)
Example 3: Reset checksum value after each checksum data
Frame 1: "\x00\x06\x00snappy" (stream header)
Frame 2: "\x82\x00\x00" (checksum start)
Frame 3: compressed or uncompressed data frame
Frame 4: "\x82\x04\x00" + checksum of Frame 2 and Frame 3
Frame 5: "\x82\x00\x00" (checksum start)
Frame 6: compressed or uncompressed data frame
Frame 7: "\x82\x04\x00" + checksum of Frame 5 and Frame 6
....
Frame 100: "\x82\x00\x00" (checksum start)
Frame 101: compressed or uncompressed data frame
Frame 102: "\x82\x04\x00" + checksum of Frame 100 and Frame 101
Frame 103: "\x03\x00\x00" (end-of-stream)
This checksum scheme is optimized for "Example 2."
= Implementation-specific frame types
0x40 - 0x7F and 0xC0 - 0xFF are freely used by implementations.
If one of them is included in a stream, an implementation name frame
(0x80) SHOULD be in the stream.
The type number SHOULD be between 0x40 and 0x7F if the frame data is
necessary to decode the stream, such as an encryption key.
The type number SHOULD be between 0xC0 and 0xFF if the frame data is
dispensable to decode the stream, such as a timestamp.
Original comment by kubo.tak...@gmail.com
on 18 Oct 2011 at 1:06
I'm not sure if I can agree this would be "simple". For instance, the support
for two different checksums to please a given compressor implementation seems
awfully complex to me.
I'm also not sure if we need to standardize comments or creators or multi-block
checksums separate from the blocks itself; what's the real-world use case for
this? The two useful real-world use cases I know of currently that need a
framing format like this (outside of Google, where we already have other
solutions in place) is “pipe through SSH” and Hadoop's usage. If we can
make something simple that cover these reasonably efficiently, and keep some
extensibility, that would probably be the best.
I agree, however, that 0x00 for the stream header is the most elegant. So
here's my proposal:
0x00 - header (as in your proposal; must be "\x00\x06\x00snappy")
0x01 - compressed block (max 32768 bytes uncompressed data, max 65531 bytes
compressed data)
0x02 - uncompressed block (max 32768 bytes data)
0x03-0x7f - reserved, fatal errors for 1.0 decoders
0x80-0xff - reserved, skippable by 1.0 decoders
All blocks have a little-endian two-byte length. Compressed and uncompressed
blocks both begin with the CRC32c of the uncompressed data (this is why the
0x01 block is max 65531 and not 65535).
There is explicitly no EOF marker, to make concatenation simple.
I think this should cover all the use cases I've seen presented so far, with
the minimal amount of complexity (and it should be very close to what Hadoop
already has implemented, as far as I understand). If snzip wants a block for
its own metadata use (comments, creator, etc.) I'd be happy to allocate 0x80 to
them for further sub-specification, which they can use for whatever they want.
Original comment by se...@google.com
on 18 Oct 2011 at 1:26
I still think sNaPpY is better because it better facilitates something like
Boyer-Moore for efficiently locking onto those envelopes if we were to use this
for high availability streaming projects.
Also, you can peek 2 bytes from a stream (via get, peek, unget) to get a 2-
byte magic number. How distinguishing is \x00\x06 relative to other file
formats? What does file/libmagic say?
Original comment by scholars...@gmail.com
on 19 Oct 2011 at 3:31
The proposal in comment #39 sounds good to me. My one complaint is that I would
change "snappy" to "sNaPpY".
As for a 9-byte magic header, I think it's just as good as PNG's 8-byte magic
header.
Original comment by nigel.ta...@gmail.com
on 19 Oct 2011 at 8:38
I can change to sNaPpY if people want; I don't see the big win, but it's not a
big loss either.
The classical magic number is four bytes long; two is not going to be unique
almost no matter what you do. Unfortunately 0x00 0x06 is reserved as “TTComp
archive data” in magic(5). How about taking 0xff instead of 0x00? That
doesn't seem to match anything, and fits nicely in with “everything 0x80-0xff
is skippable”. (0x80 is taken for “8086 relocatable (Microsoft)”.)
So:
0x00 - compressed block
0x01 - uncompressed block
0x02-0x7f - reserved, unskippable
0x80-0xfe - reserved, skippable
0xff - header
I can write up a semi-formal spec for this and stick it in the archive if
people want.
Original comment by se...@google.com
on 19 Oct 2011 at 10:02
One suggestion to the proposal in comment #42.
We need a EOF marker block.
If a compressed file is accidentally truncated exactly at the end of a
block, we cannot detect the truncation without the EOF marker block.
Original comment by kubo.tak...@gmail.com
on 19 Oct 2011 at 1:00
Hi,
We've resolved the EOF issues in seperate mail thread. I've attached my current
draft of the tentative spec.
There may or may not be an official stream compressor in the future, but it
will not be part of the first commit.
Original comment by se...@google.com
on 25 Oct 2011 at 10:51
Attachments:
Though I surely said that I agreed with you if the format was designed as a
network protocol, I don't agree as a file format.
But anyway I close my eyes to the issue. My requirements and yours are
different.
I just want to make sure one thing.
Does the spec use CRC-32C checksum defined by rfc3720 section B.4?
Otherwise, does it use masked values as "Snappy written in pure java"(*1)?
*1
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/Crc32C.
java
I guess the former because it just says CRC-32C.
Well, one more thing.
What is the standard file extension name?
gzip -> .gz
bzip2 -> .bz2
snappy -> .snappy???
Original comment by kubo.tak...@gmail.com
on 25 Oct 2011 at 12:21
We should find a standard reference for CRC-32C, yes. The iSCSI RFC you linked
to might be the authoritative reference?
We should use masked values, as you say. I'll update.
If people are happy with using a longer-than-three-character extension, .snappy
would be fine by me.
Original comment by se...@google.com
on 25 Oct 2011 at 12:32
Updated with CRC-32C reference and masking. (It is okay to use the same masking
constants as others, right?)
Original comment by se...@google.com
on 25 Oct 2011 at 1:00
Attachments:
Looks good to me.
I also think an EOF frame would be useful for detecting truncated streams (a
problem we are having right now). In the case of a concatenated file, the only
legal frame after an EOF frame would be the stream identifier frame, and the
other way around. This would make the decoder a bit more stateful, but I think
the benefit of detecting truncated stream outweighs this annoyance.
One final thing, I think we should formally agree on the value of the http
Accept-Encoding header. I'd go with just "snappy" here, but don't have a
strong preference.
Original comment by d...@iq80.com
on 26 Oct 2011 at 12:08
I suggest using 0xfe for the EOF marker since it's next to 0xff.
Original comment by electrum
on 26 Oct 2011 at 12:13
Which http Accept-Encoding header? Are people really proposing to snappy-encode
HTTP requests? (Why?)
Original comment by se...@google.com
on 26 Oct 2011 at 1:36
Original issue reported on code.google.com by
nathan.o...@gmail.com
on 18 Apr 2011 at 3:41