ericmckean / snappy

Automatically exported from code.google.com/p/snappy
Other
0 stars 0 forks source link

Command line tool #34

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This library would likely be directly useful to a lot more people if a simple 
command line program to compress/decompress from stdin to stdout was included.

Original issue reported on code.google.com by nathan.o...@gmail.com on 18 Apr 2011 at 3:41

GoogleCodeExporter commented 9 years ago
I am considering it.  When I already have snappy compressed data in my server, 
I'd like to send it directly to clients who can handle the encoding.  

More generally, I think the same analyses that leads someone to choose snappy 
over gzip or uncompressed data could lead to a decision to use it over http.

Original comment by d...@iq80.com on 26 Oct 2011 at 4:25

GoogleCodeExporter commented 9 years ago
One suggestion: would it make sense to allow use of stand-alone header marker 
as EOF as well (possibly with a modification to make sure it can be detected as 
EOF)?
While longer than a single byte, it would not require reserving more bytes, and 
handling is likely to be simple as in-stream markers need to be supported 
anyway.

I also agree with Dain in that one definitely would want to allow use of Snappy 
similar to gzip in all use cases, including compressing HTTP request payload 
(POSTs).

Original comment by tsaloranta@gmail.com on 26 Oct 2011 at 5:30

GoogleCodeExporter commented 9 years ago
I'm unaware if there's an RFC procedure to follow for this, but OK, I agree it 
could be useful, although I do not believe it would be supported in any major 
browser. In any case, you'd probably want to use e.g. “snappy-framed” to 
clearly distinguish it from raw Snappy data. Maybe x-snappy-framed?

Original comment by sgunder...@bigfoot.com on 2 Nov 2011 at 11:41

GoogleCodeExporter commented 9 years ago
bump; there has been a lot of good discussion about streaming formats. Where 
does this stand with getting a command line tool merged in? what's left to do?

I think decisions/discussions about a encoding headers to use for HTTP requests 
with a snappy payload are out of scope of this req.

Original comment by jehiah on 7 Dec 2011 at 4:08

GoogleCodeExporter commented 9 years ago
The streaming format itself is in internal review, and will hopefully enter the 
repository soon.

With regards to an actual command-line compressor, there are currently no plans 
to have one in the standard tree, but having a streaming format in place should 
make it easier to develop one out-of-tree for those who would wish so.

Original comment by se...@google.com on 7 Dec 2011 at 9:16

GoogleCodeExporter commented 9 years ago
thanks for the clarification/update. I understand the streaming format helps 
support having a command line utility, but it wasn't clear to me that the goal 
for this issue had changed from having that utility in this repo to only having 
the streaming support.

Original comment by jehiah on 7 Dec 2011 at 3:54

GoogleCodeExporter commented 9 years ago
Hi,

r54 contains the framing format spec. It's largely what I posted here earlier, 
but with some minor clarifications etc. that showed up in internal review.

Original comment by se...@google.com on 4 Jan 2012 at 10:47

GoogleCodeExporter commented 9 years ago
jehiah: I guess the goal for the issue remains the same for the reporter, so 
it's not going to be closed just because we have a streaming format. It would, 
however, probably be closed if an appropriate out-of-tree compressor appeared. 
(It could also be closed as WontFix if we permanently decide for some reason 
that we don't want to do this.)

Original comment by se...@google.com on 4 Jan 2012 at 4:55

GoogleCodeExporter commented 9 years ago
Can the standard be updated to include an EOF chunk (type 0xfe), per comment 
#48?

Original comment by electrum on 9 Feb 2012 at 7:31

GoogleCodeExporter commented 9 years ago
This has already been extensively discussed. The answer is that we've decided 
not to make an EOF chunk.

Original comment by sgunder...@bigfoot.com on 9 Feb 2012 at 7:37

GoogleCodeExporter commented 9 years ago
Implementation planned in C++ for the streaming format?

Original comment by k...@skomski.com on 1 Mar 2012 at 6:36

GoogleCodeExporter commented 9 years ago
Currently none, sorry.

Original comment by se...@google.com on 1 Mar 2012 at 6:43

GoogleCodeExporter commented 9 years ago
The currently defined framing format has two major inefficiencies:  The 4 byte 
checksum is stored for each block, rather than for a larger stream like in 
other compressed formats.  And the checksum is stored before the data, 
requiring the compressor to hold back data or rewind the output stream to store 
it.  To correct these I propose the following new chunk types:

4.4 Compressed data without checksum (Chunk type 0x02)

Like 0x00 but without the checksum

4.5 Uncompressed data without checksum (Chunk type 0x03)

Like 0x01 but without the checksum

4.6 Checksum so far (Chunk type 0x80)

Masked CRC-32C of decompressed data of all chunks (not including headers) since 
(but not including) the last chunk to store a CRC-32C checksum (currently types 
0x00, 0x01 and 0x80), but since only the latest type 0xFF chunk (inclusive).  
Thus in no case will an implementation need more than one running CRC-32C state 
per stream.

4.7 Cryptographich hash begin (Chunk type 0x81)

This stores the DER encoded OID-based algorithm identifier of a cryptographic 
hash algorithm to be applied to the decompressed data of this and all 
subsequent chunks in addition to CRC-32C.  If present, this SHOULD be right 
after a type 0xFF or 0x82 chunk, but may not be if a hashed stream is 
concatenated to a non-hashed stream.

4.8 Digital signature so far (Chunk type 0x82)

A DER encoded detached PKCS#7 signature of decompressed data of all chunks (not 
including headers) since (but not including) the last chunk to store such a 
signature (currently only type 0x82) or the last type 0x81 chunk (inclusive), 
whichever is later.  Certificate trust requirements is up to the recipient.  
The use of counter-signature "unathenticated attributes" is allowed.  The data 
hash signed by
the signature must be the one specified in the most recent preceding type 0x81 
chunk.  Chunk type 0x82 MUST NOT occur without a preceding chunk type 0x81.
These cryptographic concepts are all specified elsewhere.  This chunk SHOULD be 
placed after any type 0x80 chunk if both are present.

Example stream 1:

0xFF stream identifier    Magic string is fed to CRC-32C
0x02 compressed chunk     Decompressed data fed to CRC-32C
0x02 compressed chunk     Decompressed data fed to CRC-32C
0x03 uncompressed chunk   Data is fed to CRC-32C
0x02 compressed chunk     Decompressed data fed to CRC-32C
0x80 CRC-32C chunk covering the decompressed data
                          CRC-32C is then reset

Example stream 2:
0xFF stream identifier    Magic string is fed to CRC-32C
0x02 compressed chunk     Decompressed data fed to CRC-32C
0x02 compressed chunk     Decompressed data fed to CRC-32C
0x03 uncompressed chunk   Data is fed to CRC-32C
0x02 compressed chunk     Decompressed data fed to CRC-32C
(CRC-32C not used or checked)

Example stream 3:
0xFF stream identified    Magic string is fed to CRC-32C
0x81 hash identifier      OID is fed to CRC-32C and hash
0x02 compressed chunk     Decompressed data fed to CRC-32C and hash
0x02 compressed chunk     Decompressed data fed to CRC-32C and hash
0x03 uncompressed chunk   Data is fed to CRC-32C and hash
0x02 compressed chunk     Decompressed data fed to CRC-32C and hash
0x80 CRC-32C chunk covering the decompressed data
                          CRC-32C is then reset
0x82 digital signature covering all but the stream identified

This has the following properties:

1. It is backwards compatible with the old stream format
2. Old stream readers will see the unsupported 0x02 or 0x03 chunks and stop
3. Streams can be trivially concatenated regardless of version
4. Both CPU and size overhead is smaller because checksum masking and reinit is 
done only once for a typical stream
5. In a typical compressor, the input to CRC-32c will be the magic string 
followed by the input data, allowing a completely parallel calculation 
independent of the snappy blocking and framing.
6. In a typical decompressor, the CRC-32c can be run in parallel to outputting 
the data on the fly, objecting after the fact.
7. Except for the 32K buffering for the old type 0x00 and 0x01 chunks, there is 
no need to buffer data just for the benefit of checksumming.  And a compressor 
only needs to do this if it can be configured to produce the old format.
8. A pure hardware CRC-32c (such undoubtedly exist as stock IC design blocks) 
can be easily used if extreme hardware acceleration beyond CRC-offload 
instructions is needed.

Original comment by jb-goo...@wisemo.com on 7 May 2012 at 11:50

GoogleCodeExporter commented 9 years ago
I'm not sure if I agree with four checksum bytes per 32 kB block being a 
“major inefficiency”; that's 0.01% overhead. If you care about that sort of 
thing, you probably should not use Snappy, or at least not a framing format.

As for your other additions, I don't think the ability to digitally sign a 
Snappy file is in-scope for this bug, and I'm highly reluctant to create yet 
another way of signing files in the world. If you really have a use case for 
this, please open a separate bug, but be aware that it's quite likely to be 
closed with “won't fix”, especially as it does not look like anyone will 
write a command-line tool at all in the short term, let alone one with 
cryptographic capabilities.

Original comment by se...@google.com on 7 May 2012 at 11:58

GoogleCodeExporter commented 9 years ago
I've taken a shot at implementing the framing protocol for python-snappy.

My implementation is close to a drop-in replacement for python-zlib's 
compressobj/decompressobj interface.

Whether or not this commit gets merged into mainline python-snappy, perhaps 
someone may be interested in what I have here?
https://github.com/jtolds/python-snappy/commit/7f304a6fc96f6936fc0192932ea025aeb
2b4b9c6

Original comment by jtolds on 8 Nov 2012 at 6:01

GoogleCodeExporter commented 9 years ago
Heh, rereading the comments here I realized we probably want 
https://github.com/jtolds/python-snappy/commit/5a8660198cffc5230b2ee99e1102e8128
cc61f71 too

Original comment by jtolds on 8 Nov 2012 at 6:31

GoogleCodeExporter commented 9 years ago
Nice; do you know if there's a command-line client that uses 
compressobj/decompressobj? That would take this bug a long way towards 
completion.

Original comment by se...@google.com on 9 Nov 2012 at 1:14

GoogleCodeExporter commented 9 years ago
no but it would be super easy to whip up. i haven't heard from the 
python-snappy maintainer at all about getting my changes merged in though.

Original comment by jtolds on 10 Dec 2012 at 9:54

GoogleCodeExporter commented 9 years ago
whipped up: 
https://github.com/jtolds/python-snappy/commit/66211460734475f2076efff45e79ab3ec
dfadb84

Original comment by jtolds on 10 Dec 2012 at 11:12

GoogleCodeExporter commented 9 years ago
also, i guess i need to star issues to get emailed followup comments. sorry for 
the turnaround time on that comment, but starred now

Original comment by jtolds on 11 Dec 2012 at 12:01

GoogleCodeExporter commented 9 years ago
I haven't tested it, just a small comment; you probably want 32 kB block size, 
not 16 kB. Snappy works by default in 32 kB blocks.

Original comment by se...@google.com on 23 Dec 2012 at 11:17

GoogleCodeExporter commented 9 years ago
oh you're totally right, rookie mistake. i even briefly thought about it. 
"should it be 32kb? no, the length includes the checksum" which ends up 
actually kind of being a non-sequitur reason

https://github.com/jtolds/python-snappy/commit/f14a2187bf48dd34001ccc74588c8ec81
16f548a

Original comment by jtolds on 23 Dec 2012 at 7:07

GoogleCodeExporter commented 9 years ago
Hi guys,

First the good news: Snappy now compresses about 3% denser! Then the bad news: 
That change necessitated a change to the framing format (3-byte offsets instead 
of 2-byte).

jtolds: I'm afraid you'll need to change your implementation :-) Note that the 
stream identifier has changed as a side effect, so you won't need to worry 
about old streams being confused with the new format.

Original comment by se...@google.com on 18 Jan 2013 at 12:18

GoogleCodeExporter commented 9 years ago
nice, will update shortly

Original comment by jtolds on 23 Jan 2013 at 11:07

GoogleCodeExporter commented 9 years ago
https://github.com/jtolds/python-snappy/commit/50ea5ab816f3830a1194271bfec406d35
18eefe9

Original comment by jtolds on 9 Feb 2013 at 12:37

GoogleCodeExporter commented 9 years ago
python-snappy 0.5 now implements the latest framing format 
(http://code.google.com/p/snappy/source/browse/trunk/framing_format.txt?spec=svn
68&r=71)

Original comment by jtolds on 20 Feb 2013 at 12:11

GoogleCodeExporter commented 9 years ago
i'm not sure if anyone else has implemented the framing format, but it would be 
sweet if you could test out the python-snappy implementation to make sure it 
looks right. i'm a little concerned the only implementation i know of is just 
in the python library.

would it be worth trying to submit a c version to the mainline snappy library 
as well?

Original comment by jtolds on 20 Feb 2013 at 12:12

GoogleCodeExporter commented 9 years ago
Hi,

I tested ”python snappy.py -c < README.rst” and eyeballed the output in a 
hex editor. While this obviously won't catch all off-by-ones or things like 
wrong checksum calculations, the output does look fine to me.

I don't think there's enough interest right now to warrant a C++ implementation 
from our side, and it should probably live in a spinoff repository anyway.

Original comment by se...@google.com on 20 Feb 2013 at 3:23

GoogleCodeExporter commented 9 years ago
OK, given that there are now three independent implementations of the framing 
format, and at least one of them can be invoked from the command line, I'd say 
this is fixed. I'm updating the front page to reflect that python-snappy can do 
this. Thanks to everybody for participating :-)

Original comment by se...@google.com on 14 Jun 2013 at 11:35

GoogleCodeExporter commented 9 years ago
Issue 80 has been merged into this issue.

Original comment by se...@google.com on 22 Nov 2013 at 6:15

GoogleCodeExporter commented 9 years ago
Issue 80 has been merged into this issue.

Original comment by se...@google.com on 22 Nov 2013 at 6:37