blueprintmrk / snappy

Automatically exported from code.google.com/p/snappy
Other
0 stars 0 forks source link

Command line tool #34

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This library would likely be directly useful to a lot more people if a simple 
command line program to compress/decompress from stdin to stdout was included.

Original issue reported on code.google.com by nathan.o...@gmail.com on 18 Apr 2011 at 3:41

GoogleCodeExporter commented 9 years ago
You can use snappy_unittest for this; it's crude, but it works.

What's the intended use case? For disk-to-disk compression, usually you have 
CPU time for something like gzip -1 instead.

Original comment by se...@google.com on 18 Apr 2011 at 7:12

GoogleCodeExporter commented 9 years ago
It would just be useful in the same way that lzop is useful, as a general 
pipeline tool.  E.g. disk-to-different-disk, process-to-ssh, etc.

Original comment by yaa...@gmail.com on 22 Apr 2011 at 2:40

GoogleCodeExporter commented 9 years ago

Original comment by se...@google.com on 26 Apr 2011 at 12:56

GoogleCodeExporter commented 9 years ago
Attached patch:

1) Adds streaming support, at least for streams created with the current 
compressor
2) Creates command line tools =snzip= and =snunzip=.  Both work solely with 
standard input and output, making them most useful for pipes.

Resulting tool passes basic sanity checks (compress/decompress) and seems to 
have acceptable performance.  Has the limitation that for files larger than 
64K, the reported file size will differ from the actual file size (since the 
header must be output before the entire stream is recieved).

My C++ is rusty to nonexistant, so style/culture fixes are welcome.

Original comment by PavPanch...@gmail.com on 17 Jun 2011 at 9:24

Attachments:

GoogleCodeExporter commented 9 years ago
I made another patch, snzip.dif, which makes snzip.
It has similar options as gzip and bzip2 have as follows.

To compress file.tar:
 snzip file.tar

  Compressed file name is 'file.tar.snz' and the original file is deleted.
  Timestamp, mode and permissions are not changed as possible as it can.

To compress file.tar and output to standard out.
 snzip -c file.tar > file.tar.snz
or
 cat file.tar | snzip > file.tar.snz

To uncompress file.tar.snz:

 snzip -d file.tar.snz
or
 snunzip file.tar.snz

  Uncompressed file name is 'file.tar' and the original file is deleted.
  Timestamp, mode and permissions are not changed as possible as it can.

  If the program name includes 'un' such as snunzip, it acts as '-d' is set.

To uncompress file.tar.snz and output to standard out.

 snzip -dc file.tar.snz > file.tar
 snunzip -c file.tar.snz > file.tar
 snzcat file.tar.snz > file.tar
 cat file.tar.snz | snzcat > file.tar

  If the program name includes 'cat' such as snzcat, it acts as '-dc' is set.

It have been tested on Linux and will work on other unix-like OSs.
As for Windows, it needs a getopt(3) compatible function, which is found in 
many places as a public domain function.

Original comment by kubo.tak...@gmail.com on 31 Jul 2011 at 12:12

Attachments:

GoogleCodeExporter commented 9 years ago
Sorry, I failed to attach a correct file.
I attached a new one.

Original comment by kubo.tak...@gmail.com on 31 Jul 2011 at 12:16

Attachments:

GoogleCodeExporter commented 9 years ago
kubo your patch seems to work well; i did have to make one change for missing 
'PACKAGE_STRING' and it was not being compiled correctly by default when i do 
'make snzip', but the utility is exactly what i was looking for. I've also 
added a -v to print out the version 1.0.3

Original comment by jehiah on 12 Aug 2011 at 7:52

GoogleCodeExporter commented 9 years ago
I made a new patch to support mingw32 and cygwin.

> kubo your patch seems to work well; i did have to make one change for missing 
'PACKAGE_STRING' and it was not being compiled correctly by default when i do 
'make snzip', but the utility is exactly what i was looking for. I've also 
added a -v to print out the version 1.0.3

The missing macro 'PACKAGE_STRING' is defined in config.h by autoconf.
What version of autoconf do you use? I'm using autoconf 2.65.

I also prefer the '-v' option to print out the version. But gzip and bzip2
use it for verbose output option. So I didn't add it.

Original comment by kubo.tak...@gmail.com on 21 Aug 2011 at 9:37

Attachments:

GoogleCodeExporter commented 9 years ago
>> i did have to make one change for missing 'PACKAGE_STRING' and it was not 
being compiled correctly by default when i do 'make snzip'

Could you provide your changes?

Original comment by and...@inffinity.com on 22 Sep 2011 at 2:31

GoogleCodeExporter commented 9 years ago
guys im a litle confused here, shouldnt the download of snappy.h allow you to 
simply run this command:

snappy::Compress('/tmp/testfileinput', '/tmp/testfileoutput');

from within your c++ code? just two simple string inputs?

Original comment by mina.mou...@hotmail.com on 24 Sep 2011 at 5:00

GoogleCodeExporter commented 9 years ago
@mina.moussa : 

snappy::Compress reads the full input, performs compression, writes the full 
output.

imagine you have 5 TB of data to compress ... what do you do? well, you can buy 
lots of ram and harddisks to swap to while the compression happens. 

or better yet you can write a loop that reads in chunks of the file, runs them 
through snappy::Compress and writes each chunk to an output file with a 
container format that can later be decompressed by reading in discrete chunks 
and decompressing them. 

though i haven't played with these command line tools, if they behave properly 
they should allow you to avoid having to come up with a container file format 
and avoid writing loops for working on small chunks of the input at a time by 
enabling streaming of input to the tool which would stream 
compressed/decompressed output.

Original comment by dwil...@builderadius.com on 26 Sep 2011 at 2:49

GoogleCodeExporter commented 9 years ago
Yes, this is probably what you'd want for a command-line tool supporting pipes: 
A simple framing format. For each block, probably the compressed length (the 
uncompressed length is already in the format), perhaps some flags (EOF?), and 
the CRC32c of the uncompressed data in that block.

Original comment by se...@google.com on 26 Sep 2011 at 2:54

GoogleCodeExporter commented 9 years ago
We have a simple framing format for streaming in the Java port of Snappy:

https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/SnappyO
utputStream.java

Each 32k block is preceded by a 3-byte header, which is a 1-byte flag 
indicating if the block is compressed or not, and a 2-byte length of the block.

Our main requirements were speed and the ability to concatenate compressed 
files.  The gzip format allows concatenation, but the common Java libraries 
don't support this.  We avoided writing a checksum for simplicity and speed.  
The format doesn't currently have a header (magic number), but using a whole 
byte for the compressed flag allows adding one later.

It would be nice to have a standard streaming format and tools.  We're going to 
try to get the Hadoop project to use this format too (which is our primary use 
case).

Original comment by electrum on 26 Sep 2011 at 5:24

GoogleCodeExporter commented 9 years ago
The ability to concatenate is an interesting feature. Something that would 
combine this with the ability to detect file format would be the best, though, 
so you won't need yet another container format for that.

Not doing checksumming sounds a bit suboptimal; you can do it really cheaply on 
modern CPUs (gigabytes per second per core), especially since the data is 
already going to be in the L1 cache. Especially with multiple implementations 
starting to float around (Java vs. C++ vs. Go), it's easy to get something 
subtle going wrong.

Original comment by se...@google.com on 28 Sep 2011 at 10:28

GoogleCodeExporter commented 9 years ago
Steinar, you have a good point about checksums.

We updated the stream format to contain the masked CRC32C of the input data, 
providing protection against corruption or a buggy implementation.  We also 
added a  file header "snappy\0", which happens to be the same size (7 bytes) as 
the block header.  The file header may procede any block header one or more 
times, thus supporting concatenation including "empty" files (that contain only 
the file header).

See the SnappyOutputStream link above for the formal description.  Does this 
format sound reasonable to standardize?

Original comment by electrum on 30 Sep 2011 at 8:32

GoogleCodeExporter commented 9 years ago
OK, this starts to sound pretty good to me -- I should probably get somebody 
else in here to look at it as well, but it starts to become reasonable.

Some questions (mostly nits):

 - What do you need the uncompressed/compressed flag for? In what situations would you want to store the data uncompressed?
 - Is the length 16-bit signed or unsigned? Why is it 32768 and not 32767 or 65535?
 - Should the lengths really be stored big-endian, when all other numbers in Snappy are stored little-endian?
 - Can you verify that the CRC32c polynomial you're using is compatible with what the SSE4 CRC32 instruction computes? It sounds reasonable that if we're defining a new format, an implementation in native code should be able to make use of that instruction.

Thanks!

Original comment by se...@google.com on 3 Oct 2011 at 9:50

GoogleCodeExporter commented 9 years ago
Some drive-by comments:

For the uncompressed/compressed flag, leveldb's tables uses snappy, but if the 
compression doesn't save more than 12.5% of the bytes, then the block is left 
uncompressed on disk:
http://code.google.com/p/leveldb/source/browse/table/table_builder.cc#147

For checksums, it looks like github.com/dain is using the same CRC32c-based 
checksum as leveldb:
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/Crc32C.
java
http://code.google.com/p/leveldb/source/browse/util/crc32c.h#28

Original comment by nigel.ta...@gmail.com on 3 Oct 2011 at 10:32

GoogleCodeExporter commented 9 years ago
Here is a concrete proposal. It is possibly too complicated, but it does let a 
.snappy file start with a 7-byte magic header, and also allows concatenating 
multiple .snappy files together.

The byte stream is a series of frames, and each frame has a header and a body. 
The header is always 7 bytes. The body has variable length, in the range [0, 
65535].

The first header byte is flags:
  - bit 0 is comment,
  - bit 1 is compressed,
  - bit 2 is meta,
  - bits 3-7 are unused.

The comment bit means that the rest of the header is ignored (including any 
other flag bits), and the body has zero length. Thus, "sNaPpY\x00" is a valid 
comment header, since 's' is 0x73.

For non-comment headers, the remaining 6 bytes form a uint16 followed by a 
uint32, both little-endian. The uint16 is the body length. The uint32 is a 
CRC32c checksum, the same as used by leveldb. This differs from the Java code 
linked to above in that it's little-endian (like the rest of Snappy), and the 
maximum body length is 65535, not 32768.

The compressed bit means that the body is Snappy-compressed, and that the body 
length and checksum refer to the compressed bytes. If the bit is off, the body 
is uncompressed, and the body length and checksum refer to the uncompressed 
bytes. Each frame's compression is independent of any other frame.

The meta bit means that the body is metadata, and not part of the data stream. 
This is a file format extension mechanism, but there are no recognized 
extensions at this time.

A conforming decoder can simply skip every frame with the comment or meta bits 
set.

Original comment by nigel.ta...@gmail.com on 4 Oct 2011 at 11:03

GoogleCodeExporter commented 9 years ago
I've written a Go implementation of that proposal at 
http://codereview.appspot.com/5167058. It could probably do with a few more 
comments, but as it is, it's about 250 lines of code.

I added an additional restriction that both the compressed and uncompressed 
lengths of a frame body have to be < 65536, not just the compressed length. 
This restriction means that I can allocate all my buffers up front. Thus, once 
I've started decoding, I don't need to do any extra mallocs regardless of how 
long the stream is, or whether the uncompressed stream data looks like 
"AAAAAAAA...".

Original comment by nigel.ta...@gmail.com on 4 Oct 2011 at 1:15

GoogleCodeExporter commented 9 years ago
Answers to Steinar's questions:

Why the uncompressed/compressed flag?  As mentioned above by Nigel, for the 
same reason that leveldb does it.  Because Snappy's goal is speed, and doesn't 
compress well compared to slower algorithms like zlib, it makes sense to 
sacrifice a little more space for speed.  (We chose the same cutoff as leveldb, 
12.5%, but the cutoff is independent of the format.)

The 16-bit length is unsigned.  Why 32768 and not 65535?  Two reasons.  First, 
it matches Snappy's internal block size.  Because Snappy will split larger 
blocks, the only potential gain is fewer chunk headers.  Second, it is a power 
of two.  If you use 65535 and compress 64k (65536) bytes of data, then you end 
up with two chunks, with the second chunk being only 1 byte.

Should the length be big endian or little endian?  We chose big endian because 
that's common for file formats and network protocols.  Given that Snappy uses 
little endian, I have no objections to changing it.

The CRC32C was chosen specifically to be compatible with the SSE4 instruction.  
It's a bug if it's not.  The Java implementation uses the CRC32C code from 
Hadoop, which we haven't verified extensively, but it matched in cursory checks 
against the Python leveldb reader.

Original comment by electrum on 5 Oct 2011 at 6:03

GoogleCodeExporter commented 9 years ago
Nigel, I'm curious why you have selected a bit-flag encoding for the header 
byte instead of an enumeration of values.  I like bit-flags when the most of 
combinations are valid, but in the three flags identified, comment and meta 
would not combine well with each other or with compressed.  Alternatively, I 
propose we use the following explicit enumeration for the header bit:

  0x00: uncompressed data
  0x01: snappy compressed data
  0x73: stream header 

If the code is 0x73 (ascii 's') then the frame header block must be exactly 
"snappy\0".  All other codes are reserved.  This large reserved space allows 
for easy extension of the file format in the future.

I also suggest, we require the stream header at the beginning of the file 
instead of making it optional.

Original comment by d...@iq80.com on 5 Oct 2011 at 6:21

GoogleCodeExporter commented 9 years ago
Regarding the checksum, I thought Steinar had a very good point about the 
checksum protecting against bad encoders/decoders.  Thus, the checksum should 
always be of the original data, providing end-to-end protection for the user's 
data.

Original comment by electrum on 5 Oct 2011 at 6:51

GoogleCodeExporter commented 9 years ago
I'm happy to go with 32768 instead of 65535, given that kBlockSize == 32768 in 
C++.

Reserving all codes other than 0x00, 0x01 and 0x73 could work, but 
extension/metadata frames can have bodies, and bodies can also be compressed or 
uncompressed. I think it's just as easy to make meta-ness a bitfield bit, 
orthogonal to compression being a bitfield bit. An earlier (unpublished) design 
also had a meta-continue bit, in case the metadata's body was longer than 65535 
bytes, but I decided to leave that out until we actually have metadata to 
specify. Since I had three bits, then I figured that comment might as well be a 
bit too. Sure, not every bit combination is valid, but a lot of the bits are 
orthogonal.

Regarding the magic string, the weird capitalization of "sNaPpY\x00" is 
deliberate, to lessen the chance for a false positive. Also, I'm still leaning 
towards optional instead of mandatory, but I could be convinced otherwise.

Regarding the checksum, leveldb computes the checksum of the compressed bytes, 
not the uncompressed bytes. I don't know the reason for that, but I'm guessing 
that it was a deliberate decision. I'll ask.

Original comment by nigel.ta...@gmail.com on 6 Oct 2011 at 12:51

GoogleCodeExporter commented 9 years ago
Actually, it's pretty superficial, but what really bugs me is how the 0x73 
sticks out from everything else. What if we made the magic header "\x00sNaPpY", 
with the nul byte at the fromt. Thus:

0x00 stream header - the remaining six bytes must be "sNaPpY".
0x01 uncompressed body
0x02 compressed body

Anything else is reserved.

For anything header not starting with 0x00, and the remaining six bytes are a 
2-byte body length (up to 32768) and a 4-byte checksum (of the body's bytes on 
the wire, i.e. compressed).

Original comment by nigel.ta...@gmail.com on 6 Oct 2011 at 1:57

GoogleCodeExporter commented 9 years ago
I forgot to say that I chose to checksum the compressed bytes because the body 
for an extension frame may or may not be compressed, but it's unspecified how 
that is indicated by the opening header byte, and so a version 1.0 decoder 
won't know when to decompress the body, if it needed to checksum the 
uncompressed bytes.

Original comment by nigel.ta...@gmail.com on 6 Oct 2011 at 3:04

GoogleCodeExporter commented 9 years ago
What do you mean by version 1.0 decoder? Is there already an established format 
that we need to be backwards compatible with?

The two primary reasons I can see for making the checksum on the uncompressed 
data are:

 1. Wider protection; you'll guard against not only implementation errors, but also bit-flip errors during compression.
 2. You can run it in parallel with the compression if you need to.

I'm not sure if a maximum _compressed_ size of 32768 bytes makes sense; that 
essentially _forces_ the compressor to add logic to copy the uncompressed data 
if it didn't compress, which is an extra copy you don't want. If you really 
can't live with 65535 as the limit, I'd suggest MaxCompressedLength(32768) = 
38261.

Original comment by sgunder...@bigfoot.com on 6 Oct 2011 at 9:34

GoogleCodeExporter commented 9 years ago
There is no established format that we need to be backwards compatible with. By 
"version 1.0 decoder", I mean whatever we decide to do here.

What we have been discussing in the last dozen or so comments on this bug is a 
format that allows for extensions, but does not define any. If we decide to add 
an extension in the future (e.g. speeding up random seeks into a .snappy file), 
I'd simply like to ensure that any such extension won't break the decoding 
algorithm we decide upon here.

As for 'forcing' the compressor to add copy logic, I don't think it's 
problematic. The compression code can't compress directly to a sink because it 
needs to precede the data with a header saying how many data bytes to expect, 
and you don't know the length until after you've done the compression. Thus, it 
needs to compress to a buffer in memory.

The source (uncompressed) bytes are also already buffered in memory. 
Compression does not modify the source bytes. Thus, being able to write an 
uncompressed frame simply requires being able to choose which buffer to copy to 
the sink. I don't think that there's any unnecessary copying.

Original comment by nigel.ta...@gmail.com on 6 Oct 2011 at 12:04

GoogleCodeExporter commented 9 years ago
> You can run it in parallel with the compression if you need to.

Symmetrically, if the checksum is over the _compressed_ bytes, you can run it 
in parallel with the decompression.

But my not-based-on-any-experiments expectation is that I/O bandwidth would be 
the bottleneck in practice, and the time spent on checksumming would be 
relatively insignificant.

Original comment by nigel.ta...@gmail.com on 6 Oct 2011 at 12:11

GoogleCodeExporter commented 9 years ago
Here's another thinking-out-loud comment.

If you don't see the need to send uncompressed frames, and you limit the 
compressed body to 40960 bytes (0xA000), you can shorten the header to six 
bytes. The first two bytes form a little-endian uint16.

If that uint16 is 65535, the remaining four bytes must be "sNpY", and the body 
has zero length. The stream must start with at least one of these frames.

If that uint16 is in [0, 40960], that uint16 is the length of the compressed 
body, and the remaining four bytes is the little-endian checksum.

If that uint16 is in [40961, 65534], then this is an extension. The remaining 
four bytes (as a little-endian uint32) is the body length: the number of bytes 
to skip if the extension is unrecognized. If an extension uses a checksum, that 
checksum is given in the frame body instead of the frame header.

Original comment by nigel.ta...@gmail.com on 7 Oct 2011 at 1:23

GoogleCodeExporter commented 9 years ago
> limit the compressed body to 40960 bytes

I forgot to also say that the uncompressed body is at most 32768 bytes.

Original comment by nigel.ta...@gmail.com on 7 Oct 2011 at 1:34

GoogleCodeExporter commented 9 years ago
FYI.

> I forgot to also say that the uncompressed body is at most 32768 bytes.

I tested the uncompression speeds for various uncompressed body sizes (such as 
8k, 16k, 32k, 64k, ...) by using snzip.
The best size was 64k on my Linux box. The test data was uncompressed linux 
kernel tarball.

The best size will depends on the hardware spec. I guess it depends on the
ratio of I/O bandwidth and CPU speed.

Original comment by kubo.tak...@gmail.com on 7 Oct 2011 at 3:24

GoogleCodeExporter commented 9 years ago
proposal based on yours:
 * frame header is six bytes
 * uint16 fh0 is first two bytes (little endian)
 * uint32 fh1 is next four bytes (little endian)

if PREDICT_FALSE(fh0 & 0xc000) { /* special frame */
  if PREDICT_FALSE(fh0 == 65535) {
    /* stream header; fh1 must contain "sNpY" */
  }
  else {
    /* fh1 is length of frame body [0..2^32)
       fh0 == 65534: frame body is uncompressed data
       fh0 == 65533: frame body offset 0 .. (fh1 - 4) is uncompressed data.
                     frame body offset (fh1 - 4) .. fh1 is crc32c of the data.
       40960 <= fh0 <= 65532: extension
  }
}
else { /* not a special frame; is a compressed data frame */
  /* compressed_length = fh0
     fh1 contains crc32c of uncompressed version of the data */
}

Original comment by daniel.l...@gmail.com on 8 Oct 2011 at 12:33

GoogleCodeExporter commented 9 years ago
correction:
if PREDICT_FALSE(fh0 >= 0xc000) { /* special frame */

Original comment by daniel.l...@gmail.com on 8 Oct 2011 at 12:46

GoogleCodeExporter commented 9 years ago
arggh, second correction; just can't get that line right:
if PREDICT_FALSE(fh0 >= 0xa000) { /* special frame */

Original comment by daniel.l...@gmail.com on 8 Oct 2011 at 12:56

GoogleCodeExporter commented 9 years ago
I have released snzip.
https://github.com/kubo/snzip

This is basically same with snzip posted at comment 8.
The difference is that this is written by C, not by C++.

I don't stick to the current snzip format.
I released it to test various formats.

Original comment by kubo.tak...@gmail.com on 8 Oct 2011 at 8:24

GoogleCodeExporter commented 9 years ago
snzip was updated to support file formats of snappy-java and Dain's snappy in 
java.
https://github.com/kubo/snzip

I may add the stream format proposed at comment 32 after a week.

Original comment by kubo.tak...@gmail.com on 9 Oct 2011 at 9:09

GoogleCodeExporter commented 9 years ago
I still prefer the proposal in comment #24 over the ones in #29 or #32. I still 
don't see the problem in allowing uncompressed frame bodies, and in #32, I 
don't like how the checksum for uncompressed frames is in different places (if 
present at all) compared to the checksum for compresed frames.

For me, #24 is still the most regular: 1 byte flags, 2 byte length, 4 byte 
checksum.

Original comment by nigel.ta...@gmail.com on 10 Oct 2011 at 6:39

GoogleCodeExporter commented 9 years ago
I have stopped to add the format proposed at comment #32 to snzip.
I make another proposal. It is simple and extensible.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.

= Overview

A stream is a sequence of frames. A frame consists of frame type (1
byte), data length (2 bytes, little endian) and data. Frame types are
defined as follows:

frame type:

  0x00         stream header. The data MUST be "snappy".
  0x01         compressed data frame
  0x02         uncompressed data frame
  0x03         end of stream
  0x04 - 0x3F  reserved
  0x40 - 0x7F  implementation-specific types

  0x80         implementation name, such as "snzip 0.0.3."
  0x81         comment
  0x82         checksum (CRC-32)
  0x83         checksum (CRC-32C)
  0x84 - 0xBF  reserved
  0xC0 - 0xFF  implementation-specific types

An implementation MUST support 0x00 - 0x03, SHOULD support 0x82 and MAY
support other types. If it finds an unsupported type, it SHOULD stop
reading the stream when "(type & 0x80) == 0" and SHOULD ignore it when
"(type & 0x80) != 0".

The data length before compression MUST be less than or equal to 32k.

= Simple case

A stream MUST start with a stream header frame "\x00\x06\x00snappy" and
end with a end-of-stream frame "\x03\x00\x00". A simple stream without
checksum is:

Example 1: simple stream without checksum

  Frame 1:  "\x00\x06\x00snappy" (stream header)
  Frame 2:  compressed or uncompressed data frame
          ....
  Frame 100:  compressed or uncompressed data frame
  Frame 101:  "\x03\x00\x00" (end-of-stream)

= Checksum

I choose CRC-32 for checksum because CRC-32 is widely used and Java
supports it as a standard class java.util.zip.CRC32. You can use
CRC-32C as a option if you want to use a new instruction CRC32 in
SSE4.2 to speed up. Note that the checksum data may be ignored if a
reader doesn't support it.

To use checksum (CRC-32), a checksum start frame "\x82\x00\x00" SHOULD
be just after the stream header frame and a checksum data frame
"\x82\x04\x00" + (little endian 4-byte data) SHOULD be just before the
end-of-stream frame.

The checksum data MUST be calculated from all bytes in frames between
the previous nearest checksum start frame and a frame just before the
checksum data frame.

Example 2: A checksum for a stream

  Frame 1:  "\x00\x06\x00snappy" (stream header)
  Frame 2:  "\x82\x00\x00" (checksum start)
  Frame 3:  compressed or uncompressed data frame
          ....
  Frame 101:  compressed or uncompressed data frame
  Frame 102:  "\x82\x04\x00" + checksum of Frame 2 to Frame 101
  Frame 103:  "\x03\x00\x00" (end-of-stream)

If a reader implementation support the checksum type, it MUST start
checksum from the first checksum start frame, update the checksum
value for each frame and compare it with the value in a checksum data
frame.

Any number of checksum data frames MAY be inserted after a checksum
start frame. Any number of checksum start frames MAY be inserted any
places.

If a second or succeeding checksum start frame is found, the checksum
value MUST be reset.

Example 3: checksum for each frame

  Frame 1:  "\x00\x06\x00snappy" (stream header)
  Frame 2:  "\x82\x00\x00" (checksum start)
  Frame 3:  compressed or uncompressed data frame
  Frame 4:  "\x82\x04\x00" + checksum of Frame 2 and Frame 3
  Frame 5:  compressed or uncompressed data frame
  Frame 6:  "\x82\x04\x00" + checksum of Frame 2 to Frame 5
          ....
  Frame 101:  compressed or uncompressed data frame
  Frame 102:  "\x82\x04\x00" + checksum of Frame 2 to Frame 101
  Frame 103:  "\x03\x00\x00" (end-of-stream)

Example 3: Reset checksum value after each checksum data

  Frame 1:  "\x00\x06\x00snappy" (stream header)
  Frame 2:  "\x82\x00\x00" (checksum start)
  Frame 3:  compressed or uncompressed data frame
  Frame 4:  "\x82\x04\x00" + checksum of Frame 2 and Frame 3
  Frame 5:  "\x82\x00\x00" (checksum start)
  Frame 6:  compressed or uncompressed data frame
  Frame 7:  "\x82\x04\x00" + checksum of Frame 5 and Frame 6
          ....
  Frame 100:  "\x82\x00\x00" (checksum start)
  Frame 101:  compressed or uncompressed data frame
  Frame 102:  "\x82\x04\x00" + checksum of Frame 100 and Frame 101
  Frame 103:  "\x03\x00\x00" (end-of-stream)

This checksum scheme is optimized for "Example 2."

= Implementation-specific frame types

0x40 - 0x7F and 0xC0 - 0xFF are freely used by implementations.
If one of them is included in a stream, an implementation name frame
(0x80) SHOULD be in the stream.

The type number SHOULD be between 0x40 and 0x7F if the frame data is
necessary to decode the stream, such as an encryption key.
The type number SHOULD be between 0xC0 and 0xFF if the frame data is
dispensable to decode the stream, such as a timestamp.

Original comment by kubo.tak...@gmail.com on 18 Oct 2011 at 1:06

GoogleCodeExporter commented 9 years ago
I'm not sure if I can agree this would be "simple". For instance, the support 
for two different checksums to please a given compressor implementation seems 
awfully complex to me.

I'm also not sure if we need to standardize comments or creators or multi-block 
checksums separate from the blocks itself; what's the real-world use case for 
this? The two useful real-world use cases I know of currently that need a 
framing format like this (outside of Google, where we already have other 
solutions in place) is “pipe through SSH” and Hadoop's usage. If we can 
make something simple that cover these reasonably efficiently, and keep some 
extensibility, that would probably be the best.

I agree, however, that 0x00 for the stream header is the most elegant. So 
here's my proposal:

0x00 - header (as in your proposal; must be "\x00\x06\x00snappy")
0x01 - compressed block (max 32768 bytes uncompressed data, max 65531 bytes 
compressed data)
0x02 - uncompressed block (max 32768 bytes data)
0x03-0x7f - reserved, fatal errors for 1.0 decoders
0x80-0xff - reserved, skippable by 1.0 decoders

All blocks have a little-endian two-byte length. Compressed and uncompressed 
blocks both begin with the CRC32c of the uncompressed data (this is why the 
0x01 block is max 65531 and not 65535).

There is explicitly no EOF marker, to make concatenation simple.

I think this should cover all the use cases I've seen presented so far, with 
the minimal amount of complexity (and it should be very close to what Hadoop 
already has implemented, as far as I understand). If snzip wants a block for 
its own metadata use (comments, creator, etc.) I'd be happy to allocate 0x80 to 
them for further sub-specification, which they can use for whatever they want.

Original comment by se...@google.com on 18 Oct 2011 at 1:26

GoogleCodeExporter commented 9 years ago
I still think sNaPpY is better because it better facilitates something like 
Boyer-Moore for efficiently locking onto those envelopes if we were to use this 
for high availability streaming projects.

Also, you can peek 2 bytes from a stream (via get, peek, unget) to get a 2- 
byte magic number.  How distinguishing is \x00\x06 relative to other file 
formats?  What does file/libmagic say?

Original comment by scholars...@gmail.com on 19 Oct 2011 at 3:31

GoogleCodeExporter commented 9 years ago
The proposal in comment #39 sounds good to me. My one complaint is that I would 
change "snappy" to "sNaPpY".

As for a 9-byte magic header, I think it's just as good as PNG's 8-byte magic 
header.

Original comment by nigel.ta...@gmail.com on 19 Oct 2011 at 8:38

GoogleCodeExporter commented 9 years ago
I can change to sNaPpY if people want; I don't see the big win, but it's not a 
big loss either.

The classical magic number is four bytes long; two is not going to be unique 
almost no matter what you do. Unfortunately 0x00 0x06 is reserved as “TTComp 
archive data” in magic(5). How about taking 0xff instead of 0x00? That 
doesn't seem to match anything, and fits nicely in with “everything 0x80-0xff 
is skippable”. (0x80 is taken for “8086 relocatable (Microsoft)”.)

So:

0x00 - compressed block
0x01 - uncompressed block
0x02-0x7f - reserved, unskippable
0x80-0xfe - reserved, skippable
0xff - header

I can write up a semi-formal spec for this and stick it in the archive if 
people want.

Original comment by se...@google.com on 19 Oct 2011 at 10:02

GoogleCodeExporter commented 9 years ago
One suggestion to the proposal in comment #42.
We need a EOF marker block.
If a compressed file is accidentally truncated exactly at the end of a
block, we cannot detect the truncation without the EOF marker block.

Original comment by kubo.tak...@gmail.com on 19 Oct 2011 at 1:00

GoogleCodeExporter commented 9 years ago
Hi,

We've resolved the EOF issues in seperate mail thread. I've attached my current 
draft of the tentative spec.

There may or may not be an official stream compressor in the future, but it 
will not be part of the first commit.

Original comment by se...@google.com on 25 Oct 2011 at 10:51

Attachments:

GoogleCodeExporter commented 9 years ago
Though I surely said that I agreed with you if the format was designed as a 
network protocol, I don't agree as a file format.
But anyway I close my eyes to the issue. My requirements and yours are 
different.

I just want to make sure one thing.
Does the spec use CRC-32C checksum defined by rfc3720 section B.4?
Otherwise, does it use masked values as "Snappy written in pure java"(*1)?
*1 
https://github.com/dain/snappy/blob/master/src/main/java/org/iq80/snappy/Crc32C.
java
I guess the former because it just says CRC-32C.

Well, one more thing.
What is the standard file extension name?
 gzip -> .gz
 bzip2 -> .bz2
 snappy -> .snappy???

Original comment by kubo.tak...@gmail.com on 25 Oct 2011 at 12:21

GoogleCodeExporter commented 9 years ago
We should find a standard reference for CRC-32C, yes. The iSCSI RFC you linked 
to might be the authoritative reference?

We should use masked values, as you say. I'll update.

If people are happy with using a longer-than-three-character extension, .snappy 
would be fine by me.

Original comment by se...@google.com on 25 Oct 2011 at 12:32

GoogleCodeExporter commented 9 years ago
Updated with CRC-32C reference and masking. (It is okay to use the same masking 
constants as others, right?)

Original comment by se...@google.com on 25 Oct 2011 at 1:00

Attachments:

GoogleCodeExporter commented 9 years ago
Looks good to me.  

I also think an EOF frame would be useful for detecting truncated streams (a 
problem we are having right now).  In the case of a concatenated file, the only 
legal frame after an EOF frame would be the stream identifier frame, and the 
other way around.  This would make the decoder a bit more stateful, but I think 
the benefit of detecting truncated stream outweighs this annoyance.

One final thing, I think we should formally agree on the value of the http 
Accept-Encoding header.  I'd go with just "snappy" here, but don't have a 
strong preference.

Original comment by d...@iq80.com on 26 Oct 2011 at 12:08

GoogleCodeExporter commented 9 years ago
I suggest using 0xfe for the EOF marker since it's next to 0xff.

Original comment by electrum on 26 Oct 2011 at 12:13

GoogleCodeExporter commented 9 years ago
Which http Accept-Encoding header? Are people really proposing to snappy-encode 
HTTP requests? (Why?)

Original comment by se...@google.com on 26 Oct 2011 at 1:36