Compressing individual documents with a dictionary produced from a sampled batch to get better compression ratio

facebook / zstd

Zstandard - Fast real-time compression algorithm

http://www.zstd.net

Other

23.71k stars 2.11k forks source link

Compressing individual documents with a dictionary produced from a sampled batch to get better compression ratio #97

Closed KrzysFR closed 8 years ago

KrzysFR commented 8 years ago

I'm very excited to see work being done on dictionary support in the API, because this is something that could greatly help me solve a pressing problem.

Context

In the context of a Document Store, where we are storing a set of JSON-like documents that are sharing the same schema. Each document can be created, read or updated individually in a random fashion. We would like to compress the documents on disk, but there is very little redundancy within each document, which yields very poor compression ratio (maybe 10-20%). When compressing batchs of 10s or 100s documents, the compression ratio gets really good (10x, 50x or sometimes even more), because there is a lot of redundancy between documents, from:

the structure of the JSON itself which has a lot of ": ", or ": true, or {[[...]],[[...]]} symbols.
the names of the JSON fields: "Id", "Name", "Label", "SomeVeryLongFieldNameThatIsPresentOnlyOncePerDocument", etc..
frequent values like constants (true, "Red", "Administrator", ...), keywords, dates that start with 2015-12-14T.... for the next 24h, and even well-known or frequently used GUIDs that are shared by documents (Product Category, Tag Id, hugely popular nodes in graph databases, ...)

In the past, I used femtozip (https://github.com/gtoubassi/femtozip) which is intended precisely for this use case. It includes a dictionary training step (by building a sample batch of documents), that is then used to compress and decompress single documents, with the same compression ratio as if it was a batch. Using real life data, compressing 1000 documents individually would give the same compression ratio as compressing all 1000 documents in a batch with gzip -5.

The dictionary training part of femtozip can be very long: the more samples, the better the compression ratio would be in the end but you need tons of RAM to train it.

Also, I realized that femtozip would sometimes offset the differences in size between different formats like JSON/BSON/JSONB/ProtoBuf and other binary formats, because it would pick up the "grammar" of the format (text or binary) in the dictionary, and only deal with the "meat" of the documents (guids, integers, doubles, natural text) when compressing. This means I can use a format like JSONB (used by Postgres) which is less compact, but is faster to decode at runtime than JSON text.

Goal

I would like to be able to do something similar with Zstandard. I don't really care about building the most efficient dictionary (though it could be nice), but at least being able to exploit the fact that FSE builds a list of tokens sorted by frequency. Extracting this list of tokens may help in building a dictionary that will have the most common tokens in the training batch.

The goal would be:

For each new or modified document D, compress it AS IF we were compressing SAMPLES[cur_gen] + D.json, and only storing the bits produced by the D.json part.
When reading document D, decompress it AS IF we had the complete compressed version of SAMPLES[D.gen] + D.compressed, and only keeping the last decoded bits that make up D.

Since it would be impractical to change the compression code to be able to know which compressed bits are from D, and which from the batch, we could aproximate this by computing a DICTIONARY[gen] that would be used to initialize the compressor and decompressor.

Idea

Start by serializing an empty object into JSON (we would get the json structure and all the field names, but no values)
Use this as the the initial "gen 0" dictionary for the first batch of documents (when starting with an empty database)
After N documents, sample k random documents and compress them to produce a "generation 1" dictionary.
Compress each new or updated document (individually) with this new dictionary
After another N documents, or if some heuristic shows that compression ratio starts declining, then start a new generation of dictionary.

The Document Store would durably store each generations of dictionaries, and use them to decompress older entries. Periodically, it could recycle the entire store by recompressing everything with the most recent dictionary.

Concrete example:

Training set:

{ "id": 123, "label": "Hello", "enabled": true, "uuid": "9ad51b87-d627-4e04-85c2-d6cb77415981" }
{ "id": 126, "label": "Hell", "enabled": false, "uuid": "0c8e13a5-cdc8-4e1f-8e80-4fee025ee59c" }
{ "id": 129, "label": "Help", "enabled": true, "uuid": "fe6db321-cddd-4e7f-b3d6-6b38365b3e2a" }

Looking at it, we can extract the following repeating segments: { "id": 12.., "label": "Hel... ", "enabled": ... e, "uuid": " ... " }, which could be condensed into:

{ "id": 12, "label": "Hel", "enabled": e, "uuid":"" } (53 bytes shared by all docs)

The unique part of each documents would be:

...3...lo...tru...9ad51b87-d627-4e04-85c2-d6cb77415981 (42 bytes)
...6...l...fals...0c8e13a5-cdc8-4e1f-8e80-4fee025ee59c (42 bytes)
...9......tru...fe6db321-cddd-4e7f-b3d6-6b38365b3e2a (40 bytes)

Zstd would only have to work on 42 bytes per doc, instead of 85 bytes. More realistic documents will have a lot more stuff in common than this example.

What I've tested so far

create "gen0" dictionary with "hollow" JSON: { "id": , "foo": "", "bar": "", ....} produced by removing all values from the JSON document.
using ZSTD_compress_insertDictionary, compressing { "id": 123, "foo": "Hello", "bar": "World", ...} is indeed smaller than without dictionary.
looking at cctx->litStart, I can see a buffer with 123HelloWorld which is exactly the content specific to the document itself that got removed when producing the gen0 dict.

Maybe one way to construct a better dictionary would be:

compress the batch of random and complete document (with values)
take K first symbols ordered by frequency descending
create dictionary by outputing symbol K-1, then K-2, up to 0 (I guess that if most frequent symbol is to the end of the dictionary, offsets to it would be smaller?)
maybe one could ask for a target dictionary size, and K would be the number of symbols needed to fill the dictionary?
What I'm not sure about
I don't know how having a dictionary would help with larger documents above 128KB or 256KB. Currently I'm only inserting the dictionary for the first block. Would I need to reuse the same dictionary for each 128KB block?
What is the best size for this dictionary? 16KB? 64KB? 128KB?
ZSTD_decompress_insertDictionary branches off into different implementation for lazy, greedy and so on. I'm not sure if all compression strategy can be used to produce such a dictionary?

Again, I don't care about producing the ideal dictionary that produces the smallest result possible, only something that would give me about better compression ratio, while still being able to handle documents in isolation.

Cyan4973 commented 8 years ago

Thanks for detailed feedback @KrzysFR .

The goal would be:

For each new or modified document D, compress it AS IF we were compressing SAMPLES[cur_gen] + D.json, and only storing the bits produced by the D.json part.

When reading document D, decompress it AS IF we had the complete compressed version of SAMPLES[D.gen] + D.compressed, and only keeping the last decoded bits that make up D.

I believe the current dictionary API is compliant with this objective, in both direct and buffered mode. Obviously, the selection and generation of SAMPLES[cur_gen] is completely outside of its scope, and is therefore managed directly by the calling program.

Your suggestion regarding a moving dictionary changing gradually as compression seems to deteriorate looks good to me. Sure, there's a risk of instability, but that's part of the challenge.

Now, if your request is to get a tool similar as femtozip to build a dictionary, I'm afraid I currently have nothing to offer... yet. Although I have some ideas on how to proceed, such tool remains a fairly complex object to create, which will require a lot of availability. Considering all other remaining tasks to get to zstd v1, it's unclear if this one gets top priority. I'll look for guidance.

In the meantime, things you could try :

A basic method which gives good results in spite of its simplicity is to sample a training data, taking a few stripes of 16-64 bytes randomly from it, and just stick them together as a dictionary. Strangely enough, it gives visible results.
An even simpler method : take a few random samples, here is your dictionary.
Not sure what is the output format of femtozip, but if model.fzm is the raw dictionary content, ready for consumption for zlib, then it should work as is for zstd too. Even if there are a few specific metadata, it should work well enough too. The important thing is that it should not be encoded in any specific way.

Edit : finally found the format described. There are a few metadata at the beginning, but more importantly, there are some entropy tables encoded at the end which are not usable by anything else than fzip. In order to avoid that, it's enough to instruct --dictonly (as indicated by ./fzip -h) to only get the raw dictionary content.

As a sidenote, now that I've read this page, femtozip seems to do almost exactly what I had in mind for a dictionary builder. It's so close that I had a "meh" moment. I guess there are not that many ways to achieve the same objective.

Anyway, I'll have to think again if it's worth developing a new dictionary builder tool of rather improve this one. It looks already fairly good. The only major limitation is the size of dictionary (< 64 KB), but it should prove enough for a large number of use cases. The entropy part shall also be entirely rebuilt for zstd.

Edit 2 : well, according to this page, even dictionary size can be selected (it's not mentioned on ./zstd -h). What else ?

I don't know how having a dictionary would help with larger documents above 128KB or 256KB. Currently I'm only inserting the dictionary for the first block. Would I need to reuse the same dictionary for each 128KB block?

Currently, only the frame interface is exposed. The block interface also exists internally, but it's not exposed yet. You probably have the skill and experience to find it and use it directly, so now I'm unsure at which level you are using Zstd.

From a frame perspective, each frame is independent. That means, each frame must be re-initialized with a dictionary. The frame will internally cut input into blocks of 128 KB is need be.

From a block perspective, each block can reference previous ones, up to the beginning of the frame. In which case, when you start a second block, it's able to look into previous block AND into the dictionary beyond it.

The final limit is the window (search) size. It starts at 512 KB (zstd -1) but can grow up to 32 MB (for compatibility with 32 bits) and beyond (for 64-bits implementations only). Which means that, in contrast with gzip, the dictionary size has almost unlimited size.

What is the best size for this dictionary? 16KB? 64KB? 128KB?

In theory, the larger the dictionary size, the better the compression. However :

There are some diminishing returns. Once you have caught repetitive parts of a syntax, what remain are id elements that tend to have some common parts by accident. It's less and less viable to use that as dictionary content. At which size is such a limit achieved ? Well, it depends ... :) I would hazard that beyond a few tens of KB, most of the easy gains should be already there.
The larger the dictionary to load, the higher the memory consumption, and the longer it takes to initialize search structures.

The decompression part is much more lucky here : it just needs to reference the place of the dictionary, without any extra work. A single dictionary is enough and can remain in place for multiple threads to work in parallel.

ZSTD_decompress_insertDictionary branches off into different implementation for lazy, greedy and so on. I'm not sure if all compression strategy can be used to produce such a dictionary?

Yes, all of them.

KrzysFR commented 8 years ago

Thanks for all these explanations.

Regarding femtozip I used a .NET port by Ayende Rayen (https://github.com/ayende/Rhea.Compression) that may or may not differ from the actual C version, and only for R&D and prototyping. The dictionary builder was especially slow, but I remember that compression was decent and decompression very fast.

From a frame perspective, each frame is independent. That means, each frame must be re-initialized with a dictionary. The frame will internally cut input into blocks of 128 KB is need be.

I'm pretty sure documents would be compressed using a single frame which should simplify things a lot (and most documents will be less than 128KB in size). I expect that using a dictionary is only when you are compressing very small things anyway.

Which means that, in contrast with gzip, the dictionary size has almost unlimited size.

That's something I didn't know. This can indeed simplify building the dictionary because we would not have to make it fit into the window size, and there is no risk of having everything slide out of the window after the first hundred KB of compressed data.

Are there any point in trying to have very short offsets when compressing? If I had a dictionary of size 256KB with the most frequently use token at the start of the dictionary, would this be less efficient than having that token at the end (closer to the data) ?

The larger the dictionary to load, the higher the memory consumption, and the longer it takes to initialize search structures.

I've done some testing this afternoon, by creating a "hollow" JSON template from an existing dataset (merging every documents into a single one and replacing values with null/empty).

Compressing each documents with this initial dictionary is WAY slower than compressing without. By that I mean 2 milliseconds without versus 20 seconds with dictionary for one hundred 2KB documents (x10,000 !?)

I'm not exactly sure if I'm doing something wrong or not. Regular compression is done by calling into ZSTD_compress which is blazing fast, while with the dictionary I had to use the ZSTD_compressBegin/ZSTD_compressContinue combo, using a ZSTD_CCtx.

~~A quick profling runs does show that the bulk of the time is spent somewhere in msvcrt probably in memory allocation/free:~~ (profiler showing wrong stacktrace removed) (5.8sec for 31 calls is an average of 190ms per compression??)

The major difference I see is that ZSTD_compress has a context allocated on the stack, while ZSTD_createCCtx calls calloc (which must be in msvcrt120.dll). This is probably something silly but I did not see this before because I've never used ZSTD_CCtx directly before from .NET code.

Anyway, the compression ratio with this dictionary (about 2.2 KB of "hollow" JSON fields names and structure) are:

RAW: 660 KB (sampling of 100 random documents)
zstd -1 of the whole batch: 51 KB (1:13, the target goal)
sum of zstd -1 of each individual documents: 175 KB (1:3.7, that's already pretty good! gzip gave me 10% in earlier tests!)
sum of zstd -1 of each individual documents + dictionary: 126 KB (1:5.2, better but still 2.5 larger than batched).

This indicates to me that doing the extra work on builder a better dictionary would still be a significant gain.

Cyan4973 commented 8 years ago

Are there any point in trying to have very short offsets when compressing?

Yes. Small distances are easier to describe than long ones, and tend to cost less. So it's still useful to have most common elements at the end of the dictionary, hence closer to the file to compress.

Compressing each documents with this initial dictionary is WAY slower than compressing without.

Slower, surely, but not by as much as described. So there must be a problem somewhere. Let me investigate.

KrzysFR commented 8 years ago

Ahah ok, i "fixed" the compression issue. Let's say I forgot to add the compressionLevel parameter to the PInvoke signature of ZSTD_compressionBegin which would be called with whatever was on the stack. I could understand why compressionLevel 100 hundred billion would take slightly longer.

Maybe by bad luck the value was between 0 and 20, but if not it would mean that the compressionLevel is not properly validated by ZSTD_compressionBegin?

Anyway, now the compression with dictionary is 140KB instead of 126 KB which is not much gain for a dictionary that should perfeclty cover about a 1/3rd of each documents (2.2KB dict for 6.6KB docs).

Cyan4973 commented 8 years ago

Any compression level value > MAX is corrected into == MAX. So it should be the same as requiring 20.

Which is basically enough to explain the vast difference in speed observed.

KrzysFR commented 8 years ago

I whould have thought that passing a level of 0xDEADBEEF would return an error instead of silently consuming 20 seconds of CPU (from expected 2ms)...

Also, since this has nothing to do with memory allocations, this means that I cannot trust at all my profiler regarding native callstacks :( oh well...

Cyan4973 commented 8 years ago

I whould have thought that passing a level of 0xDEADBEEF would return an error instead of silently

yes, it could, it's a matter of interface convention.

My initial reasoning was that maximum level wasn't settled and could change from one version to another, so rather than forcing users to follow these variations, it looked simpler to default to closest one. By the same logic, a negative compression level is also translated into 1.

But of course, it could be argued that it should instead trigger an error.

If the scale of compression levels remain stable, then I guess the situation change, and it sounds better to flag that as an error. Just, let's wait a bit more time to ensure the scale is now really stable .

Cyan4973 commented 8 years ago

There is another side effect to consider :

Compression modes are selected by a range of options which can be directly managed through _advanced() variants of compression functions. These options are pretty complex, so they have been simplified into compression level, which is basically an index into a pre-calculated table.

However, there are several tables. The reason is, the heuristics selected for large file absolutely suck on small data blocks. So, instead, several sets of parameters were created, depending on input size.

Input size is a necessary ingredient within ZSTD_compress(), so it's automatically taken care of. However, when using the streaming variant, required by shared dictionary, nothing is provided by default. In such a case, zstd uses the "default" table, that is the one for large files.

There is a work-around : use the _advanced() variant, in association with ZSTD_getParams(), and provide the combined size of dictionary + data block as a source size hint. It should make a better job, selecting more efficient parameters, especially for small blocks.

KrzysFR commented 8 years ago

Looks like there is an issue currently with using ZSTD_getParams() through a dll from a managed language: since ZSTD_parameters is a struct returned by value, I'm not exactly sure how I can define a valid method signature from the .NET side: seems like the struct is currently 32 bytes long, and the largest value type in .NET is 16 bytes. I could probably replicate the ZSTD_parameters C struct as an equivalent .NET struct, but I'm afraid about the C compiler optimizing the layout of the fields in some weird ways... I'm also not able to do sizeof(ZSTD_parameters) which could change at any time...

It would be easier if the method would return a ZSTD_parameters*, or would take a pointer to a user allocated space and copy the struct there, but how to deal with the sizeof(..) issue? Looks like an API break :( I did not catch this issue before since I wasn't using the _advanced functions and the ZSTD_CCtx/ZSTD_DCtx are not impacted by this.

There are some methods that already take a ZSTD_parameters* as argument. Is there a reason why ZSTD_getParam() returns a struct instead of a pointer? (pointer could allow modifying the table directly?)

Cyan4973 commented 8 years ago

I'm afraid about the C compiler optimizing the layout of the fields in some weird ways...

It should not. A C89/C99/C11 compiler cannot select to "optimize" a structure layout, layout is clearly described by the relevant standard, and it's strictly forbidden to play with it, as it is an essential part of ABI.

I'm also not able to do sizeof(ZSTD_parameters) which could change at any time...

Sorry, I didn't pay attention to such limitation. I guess it's related to using a DLL from .NET. I suspect the only "good" way is to expose a create method.

Is there a reason why ZSTD_getParam() returns a struct instead of a pointer?

For convenience. Also because of an old advise from John Carmack to make code easier to debug and maintain by borrowing some principles from functional programming.

But these objectives are secondary, and shouldn't prevent usage of the interface to begin with.

It would be easier if the method would return a ZSTD_parameters*

Yes, possibly. Though, my initial suggestion was merely to feed the result directly to the init function, so that :

ZSTD_compressBegin(ctx, dst, dstSize, cLevel);

becomes

ZSTD_compressBegin_advanced(ctx, dst, dstSize, ZSTD_getParams(cLevel, dictSize+blockSize));

Not sure it that's possible from .NET, but it's a valid C construction. In above proposition, it's not necessary to access / allocate ZSTD_parameters structure. Or do you explicitly want to access the structure content ?

(pointer could allow modifying the table directly?)

This is a const memory region, so it can only be read.

There are some methods that already take a ZSTD_parameters* as argument.

Indeed, for example : void ZSTD_validateParams(ZSTD_parameters* params);

In this case, the objective is to modify params, in order to ensure it fits within authorized limits. Hence it was passed as a ptr.

Now, it's true that it's also possible to write it this way : ZSTD_parameters ZSTD_validateParams(ZSTD_parameters params); and it would look "more functional". And maybe it should be changed to look this way to keep the interface coherent. Just it felt more natural to pass a ptr in this case.

the other example is on the decompression side : size_t ZSTD_getFrameParams(ZSTD_parameters* params, const void* src, size_t srcSize);

It was selected this way because both params and function result are needed. Function result can be an error code, so it must be size_t. Hence there is no other solution than ptr to provide "params".

Looks like an API break :(

Note that anything within zstd_static.h should be considered breakable as long as we are in beta status (v0.x serie). After all, it's necessary to test and make mistakes in order to improve.

KrzysFR commented 8 years ago

A C89/C99/C11 compiler cannot select to "optimize" a structure layout, layout is clearly described by the relevant standard, and it's strictly forbidden to play with it, as it is an essential part of ABI.

I guess I'm used to the JIT optimizing the layout at runtime in safe contexts (where you are not allowed to get offset to members of types by default) in order to align pointers (required) and fill gaps. As an example: struct { int32 A; void* B; int32 C } can be laid out as { A, B, C } on 32 bits with no waste, but would be better laid out as { B, A, C } on 64 bits to remove gaps (since B must always be aligned on 64 bits addresses). Having the JIT automatically do that for you is nice... except when you have to respect the layout of a struct defined in another non-managed API :) That's what LayoutKind.Explicit is made for

I guess it's related to using a DLL from .NET.

I should stress that this is not specific to .NET but probably most type-safe JITed VMs (on Windows probably?).

I guess we could say that when building a DLL, all the informations in the .h gets erased, and only the pointers to the functions entry points are kept. Hence why I don't have access to #defines, structs, or static tables, only pointers to exported functions.

When you are consuming such a dll from C/C++ you can include the original .h and if you have the matching version, things should work OK.

But in .NET, for example, you cannot use a .h so you have to provide your own version via PInvoke, where you substitute C types with equivalent .NET types.

That's why, in my wrapper, I have to use IntPtr to emulate pointers (since I don't have the original struct definitions and don't like void*) and UIntPtr to emulate size_t, because they are the only value types in .NET whose size changes automatically to 32 or 64 bits (at runtime after JITing). All other types have fixed size, and more complex struct-like types need special treatment by the VM which must marshal them everytime you cross the boundary. Hence why it's faster to only rely on simple value types (which require no marshalling) or pointers (which most of the times with C libraries are opaque pointers, from the point of view of the app).

_Edit: forgot to say that most .NET assemblies are compiled for "AnyCPU", and are only specialized to 32 or 64 bits at runtime by the JIT. This means that when C# interop code is compiled, it must be platform neutral (can't use #if / #else) and rely on a loader to select the proper zstdlib_x##.dll variant. This is especially problematic for BigEndian vs LittleEndian but fortunately I haven't have to deal with that yet !_

.NET supports "unsafe" structs with explicit layout, in order to interop with native libraries, so I think I would be able to make it work with the current code (at the cost of extra cpu for the marshalling of the struct between native and managed code, but this is only at init time).

Though, if you change the struct definition (to add or remove fields), then the wrapper code will break the stack at runtime. The .NET CLR should be able to catch this and throw an exception, but that's not pretty.

Cyan4973 commented 8 years ago

Though, if you change the struct definition (to add or remove fields), then the wrapper code will break the stack at runtime.

Indeed, static definitions are always a problem for DLL. I'll likely keep that part into zstd_static.h for this reason, but zstd.h will need a DLL compatible equivalent, such as a create method.

When it's absolutely necessary to define structures into .h, I tend to give them a bit of headroom, for potential future expansions. This is the solution selected for LZ4F and it has worked well so far, allowing several interface updates without breaking ABI.

KrzysFR commented 8 years ago

So I was able to make the ZSTD_getParam() work as intended, and I get the correct values in the fields. I also tweaked ZSTD_compress_advanced to be able to pass the dictionary buffer so that I can reuse the existing compression code.

But now I'm getting access violations on a specific document (reproducible in Release and Debug build) somewhere inside ZSTD_compressBlock_fast_generic (mls=4) if I specify a dictionary, but OK with no dictionary. It works fine for a few dozens documents before this one. I'm in the process of installing the Windows SDK to be able to run it with windbg (oh joy!) because without it I can't tell much more.

Cyan4973 commented 8 years ago

But now I'm getting access violations on a specific document

Maybe there's a bug to fix. If the document and dictionary can be shared, I can try to debug it on my side too.

KrzysFR commented 8 years ago

With windbg I get the access violation in BIT_addBits, because nbBits is equal to 0xCCCC (so its an overflow in mask[], not the source or destination.

This is because symbolTT in FSE_encodeSymbol (called by ZSTD_compressSequences) contains garbage. I also see that uninitialized variables on the stack also have value 0xCCCC... so I suspect that FSE_encodeSymbol is reading from unitialized part of the stack somewhere.

I've read that .NET clears newly allocated memory (including the stack?) with 0xCC bytes in Debug mode, which is compatible with what I see.

I would have to review the sample document to see if it does not contain sensible data before sending it.

Cyan4973 commented 8 years ago

sure !

Nice investigation, understanding why FSE_encodeSymbol() is reading from unitialized memory can become a fairly complex issue.

KrzysFR commented 8 years ago

And of course, running zstd.exe from the command line with the same input and dictionary does NOT fail. and gives me the expected result. I'm starting to think that this is not the document itself, but some combination of memory alignements and .NET Pinvoke weirdness with passing struct by value as argument.

Cyan4973 commented 8 years ago

sounds reasonable

Cyan4973 commented 8 years ago

For information, at some point during development of streaming interface, ZSTD_compressBegin() prototype looked like this :

size_t ZSTD_compressBegin(ZSTD_CCtx* cctx, void* dst, size_t maxDstSize, int compressionLevel, unsigned long long srcSizeHint);

Note the last parameter. Using srcSizeHint would provide information to make the algorithm select the appropriate table of parameters, and is basically what is achieved when doing : ZSTD_compressBegin_advanced(ctx, dst, dstSize, ZSTD_getParams(cLevel, dictSize+blockSize));

It was later removed in favor of ZSTD_compressBegin_advanced() in order to reduce number of functions exposed, hence complexity of interface. So now, it's either simple (just compression Level) or advanced (complete control over all parameters).

While this looks good, the _advanced() version requires a structure to manipulate the numerous and potentially evolving parameters. However, I did not expected that a function with struct return or argument type would cause so many problems when invoked from a different (managed) language.

Hopefully, it can be sorted out, but if not, or if too complex, maybe a different interface will be preferable.

KrzysFR commented 8 years ago

The way I see it, if you start adding features, like dictionaries, and so on, you will need to add more and more arguments to the _advanced variant. Maybe, since the function takes in a ZSTD_CCtx*, we could simply call a bunch of ZSTD_context_setFoo(...) for each new features (so right now the params and dictionaries), BEFORE calling ZSTD_compress_advanced, this would be forward compatible. But right now the ctx is reset by ZSTD_begin... which would erase all settings.

Regarding the access violation, I just realized that when invoked from the commandline, the code goes through ZSTD_compressBlock_fast_extDict_generic but when I call ZSTD_compress_advanced (changed to call ZSTD_compress_insertDictionary), the code goes through ZSTD_compressBlock_fast_generic instead, which calls a different set of functions.

My dictionary is allocated in a different buffer than the source, and even though they are probably close in memory (allocated in the same gen0 heap by .NET GC) they should not be contiguous (I'm allocating buffers rounded to the next power of 2).

KrzysFR commented 8 years ago

Changing the C code by adding fprintf everywhere does change the result, but now I'm seeing a buffer overflow inside ZSTD_compressBlock_fast_generic in the main while(ip <ilimit) loop with ip=0x00000001118D24EB and ilimit = 0x00000001118D24EC. Meaning that ip ends up on the very last byte of the buffer, which will overflow in ZSTD_hashPtr or in MEM_read32(ip+1-offset_1) below.

I guess in this case, the buffer is right on top of the stack and touches guard pages right after it, while in some other cases there is extra space after it (filled with 0xCCCC) and it continues with garbage input that break later functions (what I was seeing above).

I'm sorry I cannot give better details because I have really no way of debugging managed and C code at the same time.

Cyan4973 commented 8 years ago

ZSTD_compress_advanced() is not compatible with streaming, nor dictionary. It's only meant to compress a single memory segment, like its sibling ZSTD_compress().

The expectation for dictionary compression is to use the streaming interface only : Init by calling ZSTD_compressBegin_advanced() then ZSTD_compress_insertDictionary() to enable dictionary compression, then loop on ZSTD_compressContinue() to compress content, and finish with ZSTD_compressEnd().

The way I see it, if you start adding features, like dictionaries, and so on, you will need to add more and more arguments to the _advanced variant

Well, that's why ZSTD_parameters was created, to add new arguments without modifying function prototypes. But really I wasn't expecting that manipulating structures would be such an issue outside of C. After all, even zlib uses structures within its interface.

KrzysFR commented 8 years ago

I was doing that sequence of calls before, and it did work fine. I guess if ZSTD_compress_generic does not expect a dictionary, it could explain the weirdness I see. It seems that it works most of the times though.

I thought that calling a single C function that does everything would be quicker than calling multiple functions from .NET (which incur a marshalling cost everytime). If ZSTD_compress_advanced is not the correct method to call, then I could at least try to add a different function.

I guess we should see if this will fix my issue, before incriminating the ZSTD_parameters struct.

Cyan4973 commented 8 years ago

OK, so I guess what you were expecting is :

1) a kind of preparation function, to declare dictionary compression, generating a context which can be duplicated multiple times for the many documents to come

2) a single function call to compress each document, like ZSTD_compress(), but for dictionary compression.

I suspect I can modify the interface to look this way.

KrzysFR commented 8 years ago

This is my current state of affairs after fprintfing ALL THE THINGS!

When called from ztsd.exe (OK): ZSTD_compressBlock_fast_extDict_generic(..., mls=4, lowLimit=0, dictLimit=2253)

When called from a custom method that calls ZSTD_compressBegin_advanced + ZSTD_compress_insertDictionary + ZSTD_compressContinue + ZSTD_compressEnd (FAIL with AVex): ZSTD_compressBlock_fast_extDict_generic(..., mls=4, lowLimit=-32479, dictLimit=-27011)

I'm seeing NEGATIVE values for lowLimit and dictLimit in my case (-32479 and -27011). Is that possible if they are defined as U32? (fprintf shows negative because of %d I guess)

Looking at the code in ZSTD_compressContinue, the negative values come from the first if (src != zc->nextSrc) that set lowLimit = 0 and dictLimit = -27011.

Logs: (1) is start, (2) after first IF, (3) after second IF, (4) after third IF right before call to ZSTD_compress_generic

> ZSTD_compressContinue(op=0x000000BC6C92ABF5 11520, src=0x000000BC62C00450 10916)
(1) ip=000000BC62C00450, base=000000BC62C00450, nextSrc=000000BC62BF9ACD, lowLimit=0, dictLimit=0
(2) ip=000000BC62C00450, base=000000BC62C06DD3, nextSrc=000000BC62BF9ACD, lowLimit=0, dictLimit=-27011
(3) lowLimit=0, dictLimit=-27011
(4) lowLimit=10916, dictLimit=-27011
...
ZSTD_compressBlock(..., lowlimit -32479 dictlmit -27011)

Cyan4973 commented 8 years ago

I'm seeing NEGATIVE values for lowLimit and dictLimit in my case (-32479 and -27011). Is that possible if they are defined as U32?

Nope. It's necessarily a bug.

the negative values come from the first if (src != zc->nextSrc)

This code detects that memory segments are not contiguous, which is normal since the first one is the dictionary, and the second is the input to compress.

In theory, after this if statement : dictLimit = (U32)(zc->nextSrc - zc->base) == sizeOfDictionary

Strangely enough, in your test, zc->base > zc->nextSrc, which is supposed to be impossible.

So maybe something is modifying zc->base (or zc->nextSrc) in an unexpected way. It would happen before ZSTD_compressContinue(), hence probably within ZSTD_compress_insertDictionary().

Cyan4973 commented 8 years ago

Looking into ZSTD_compress_insertDictionary(), I see this line :

zc->base += ip - zc->nextSrc;

Both zc->base and zc->nextSrc are supposed to be == NULL at this stage. So it should result in : zc->base = ip == src == startOfDictionary

Now, maybe zc->base and/or zc->nextSrc are not NULL .... It would not be expected, but could explain the strange value of base observed later on, within ZSTD_compressContinue() .

KrzysFR commented 8 years ago

I had to replace it with zc->base = (const BYTE*) (ip - zc->nextSrc); to fix a warning treated as error by VS.

It does fix the issue with zc->lowLimit which is now positive, but zc->dictLimit is till negative (-27011).

For reference, I have src ptr = 0x994FE62B80 (10916 bytes) and dict ptr = 0x994FE5B930 (2253 bytes) (parameter={W14 C14 H14 S1 L4 SZ10916})

raw log:

WithDictionary ctx=0x0000009958E2AC50 with dst=0x0000009959B6B0D0 11525, src=0x000000994FE62B80 10916, dict=0x000000994FE5B930 2253, prms={W14 C14 H14 S1 L4 SZ10916}
> Adding dictionary 0x000000994FE5B930 of size 2253
> ZSTD_compressContinue(op=0x0000009959B6B0D5 11520, src=0x000000994FE62B80 10916)
(1) ip=000000994FE62B80, base=000000994FE62B80, nextSrc=000000994FE5C1FD, lowLimit=0, dictLimit=0
(2) ip=000000994FE62B80, base=000000994FE69503, nextSrc=000000994FE5C1FD, lowLimit=0, dictLimit=-27011
(3) lowLimit=0, dictLimit=-27011
(4) lowLimit=10916, dictLimit=-27011
  > ZSTD_compress_generic ip 000000994FE62B80, op 0000009959B6B0D5, remaining 10916, maxDstSize 11520
ZSTD_compressBlock(0000009959B6B0D8 11517, 000000994FE62B80 10916, -32479 -27011)
  > ZSTD_compressBlock_fast_extDict 4
ZSTD_compressBlock_fast_extDict_generic(0000009959B6B0D8 11517, 000000994FE62B80 10916, seqtore 0000009958E2ACB0, mls 4, lowLimit=-32479, dictLimit=-27011)
> ip=000000994FE62B84, ilimit=000000994FE6561C h=13276, matchIndex=0, current=-27007, repIndex=-27010

KrzysFR commented 8 years ago

And here is my test function added (with all the fprintf glory)

size_t ZSTD_compress_withDictionary(ZSTD_CCtx* ctx,
    void* dst, size_t maxDstSize,
    const void* src, size_t srcSize,
    const void* dict, size_t dictSize,
    ZSTD_parameters params)
{
    BYTE* const ostart = (BYTE*)dst;
    BYTE* op = ostart;
    size_t oSize;

    fprintf(stdout, "WithDictionary ctx=0x%p with dst=0x%p %llu, src=0x%p %llu, dict=0x%p %llu, prms={W%d C%d H%d S%d L%d SZ%llu}\r\n", ctx, dst, maxDstSize, src, srcSize, dict, dictSize, params.windowLog, params.contentLog, params.hashLog, params.searchLog, params.searchLength, params.srcSize);

    /* Header */
    oSize = ZSTD_compressBegin_advanced(ctx, dst, maxDstSize, params);
    if (ZSTD_isError(oSize)) return oSize;
    op += oSize;
    maxDstSize -= oSize;

    if (dict)
    {
        fprintf(stdout, "> Adding dictionary 0x%p of size %llu\r\n", dict, dictSize);
        oSize = ZSTD_compress_insertDictionary(ctx, dict, dictSize);
        if (ZSTD_isError(oSize)) return oSize;
    }

    /* body (compression) */
    ctx->base = (const BYTE*)src;
    fprintf(stdout, "> ZSTD_compressContinue(op=0x%p %llu, src=0x%p %llu)\r\n", op, maxDstSize, src, srcSize);
    oSize = ZSTD_compressContinue(ctx, op, maxDstSize, src, srcSize);
    if (ZSTD_isError(oSize)) return oSize;
    op += oSize;
    maxDstSize -= oSize;

    /* Close frame */
    fprintf(stdout, "> ZSTD_compressEnd(op=0x%p %llu)\r\n", op, maxDstSize);
    oSize = ZSTD_compressEnd(ctx, op, maxDstSize);
    if (ZSTD_isError(oSize)) return oSize;
    op += oSize;

    fprintf(stdout, "> done! %llu\r\n", op - ostart);

    return (op - ostart);
}

Cyan4973 commented 8 years ago

ah ok. now I understand. I did not caught that you were building your own function in C, directly modifying internal state.

This line should be removed : ctx->base = (const BYTE*)src;

This line is likely inherited from original ZSTD_compress_advanced() which is only meant to compress a single memory segment. In this specific case, it makes sense, because it doesn't call ZSTD_compress_continue() but ZSTD_compress_generic() instead.

In a streaming scenario, base is handled within streaming functions. It's a critical variable to track indexes and search limits, and should not be manipulated directly.

Anyway, your sample is interesting, as it helps me understanding what kind of interface you would like to use.

KrzysFR commented 8 years ago

This fixed it! Yes, I copy/pasted the ZSTD_compress_advanced() because it looked like what I wanted. Sorry for the mixup.

I'm now able to compress with a dictionary using a single method, and get the following results on a complete dataset:

RAW: 41.2 MB
zstd BATCH: 2.8 MB
zstd 1by1: 10.9 MB (in 235 ms)
zstd 1by1 with dict: 9.4 MB (in 204ms)

Looking at the dataset, I see that there is a lot of stuff in the values that could be in the dictionary. I should also try the other simpler methods for generating a dummy dictionary to see what's best.

KrzysFR commented 8 years ago

Creating a dictionary by concatenating 5 random documents gives me a dictionary of 29KiB and compression of 8.2 MB (better than 9.4 MiB). But if I take 10 documents I get 9.2 MB (marginally better) and with 25 documents I get 9.5 MB (worse).

For the other technique of taking random striped of 64 bytes from sample documents:

32 KB dictionary: 8.9 MB
64 KB dictionary: 8.8 MB
128 KB dictionary: 8.7 MB
256 KB dictionary: 9.0 MB

Taking larger stripes (128 bytes) gives 8.5 MB with a dict of 128KB.

All in all, all methods seems to gravitate around the same compression ratio, which is not much better than no dictionary at all (25% better), but still ~3.5X worse than compressing the whole batch.

Cyan4973 commented 8 years ago

Could be worth trying femtozip with option --dictonly ...

Cyan4973 commented 8 years ago

Out of curiosity, what's the (average) size of each document ? From the sample provided, it looks as if they are ~5 KB each.

KrzysFR commented 8 years ago

Average size of documents is 6.6 KB but can vary from 2 KB to 20KB.

Ok, so fzip --dictonly from a sample of 100 docs give me a 64KB dictionary, which, when used with zstd -1 gives a total size of 8.2 MB. Using fzip --compress with the same model gives a file size of 3.2 MB (against 2.8 MB for zstd -1 on the complete batch).

Seems like only having the dictionary for zstd is not enough, and I guess you'd need to capture more of the internal state of FSE/huff0 to achieve the same kind of results... That's a bummer :)

I tried using zstd -5 with the femtozip dictionary, and I get 7.6 MB (950 ms cpu time) and 7.4 MiB with zstd -9.

Cyan4973 commented 8 years ago

Well, basically, yes. That's by the way the last major feature I want to include within zstd before finalizing the format.

At the start of each compressed document, there are entropy tables descriptors. Block headers, basically. The cost of these headers vary, but expect a few hundred bytes.

When using default block size of 128 KB, it doesn't matter much. But for very small blocks, it does. If it consumes say 200 bytes, that's about 20% for a 1K document (after compression). There are also some "lost opportunities", since the algorithm will not even try compression as soon as it detects that header + compressed size is going to be worse or marginally better than raw size.

So the idea is to create extremely compressed entropy descriptors, with diff maps from a model, which can be calculated while building the dictionary.

femtozip goes one step further by not generating any entropy table descriptor within the compressed document, using only the one generated with the dictionary. That's almost the same thing : it's a diff map full of zero.

So yes, that part is the last one planned for zstd to reach v1 status.

KrzysFR commented 8 years ago

I've tried something else: compressing the dictionary as a first block, then compressing the data as a second block (so two calls to compressContinue), and then keeping only the last block. This gives me about the same total size as if I was calling _insertDictionary, which I guess is expected. You are probably not keeping any state from one block to the other, except for the window size which can see the previous block source text.

I'm not going to mess with dictionary just yet if this only gives me 20-25% gain, and will probably have to patiently wait for v1 :)

Lots of thanks for taking some time to help me with this!

Cyan4973 commented 8 years ago

Thanks for pushing the limits @KrzysFR. It really helps to plan future directions.

KrzysFR commented 8 years ago

By the way, I think that you can keep ZSTD_parameters as a struct, it seems to be working fine once I used the API correctly. It's currently at 32 bytes on x64 which is still small enough I'd say.

If the application treat it as an opaque struct, there is no issue with managed languages. Only problem is if you add or remove fields, changing the size of the struct. In this case, managed wrappers would need to update their code to stay in sync. If you don't change it often, or only with major versions, then it seems acceptable to me.

KrzysFR commented 8 years ago

Continuing with my tests with dictionaries, I've seen a some weird things that may or may not be expected.

I have three tests:

typical JSON document (1KB) with a mix of natural text names, product sku, ip addresses,...)
natural text (french song lyrics, 3KB)
string of random digits (512KB)

For each, I'm trying to compress with a naïve dictionary: "hollow" json structure, a few french generic words from the song, and for the random digits the string "0123456789876543210" (expected to be ineffective).

I'm getting the following results:

JSON: raw = 1150, nodict = 702 (61%), dict= 483 (42%)
Song: raw = 3138, nodict = 1724 (55%), dict = 1858 (59%, worse!)
Digits: raw = 524293, nodict = 221212 (42%), dict = 221212 (42%)

Both JSON and Digits test look good. The digits still compress because they only use 10 symbols out of 256 possible. The dictionary does nothing for the digits, and compression is the same

But I'm a bit surprised by the fact that the song lyrics actually compress worse with the naive dictionary.

I used my trusty ASCII art data visualization toolset to dump byte distributions for each compressed buffer, and I get the following result:

note: you may need to c/p into a fixedsys text editor or something, best viewed on back text/white background

## Generic JSON document
Raw    :  1,150 [                                $ @     :: :m~==ooo==;;==~m      =.==;+.:=~.=~=;=~+o=~-.--  +    m;=oM===o .o=ooo oooo..-=.. .     .                                     .                        ..                                                            ]
Zstd   :    702 [@oo=== =m=~~o=~=o=~==oo=o~~~o:: =~:==== =~o~ ===o=:=~==~=:=~=~=:~oM=o ===~=~~~ =o :~=~~=~= =~~:==M~=~:::====o:~=oo:===~ :=~~M=:=~o=:=::~m::o~ =~ :o:==~::~o=~=o=o:~:=~=~~= ~~ =: === m~==~==: = ==~ ::===o=:~:::~o~=~~=:~ ~:m=====:o=::o:=oo~=~ ===~~=:~= ~=== ~]
Zstdict:    483 [@m======m;==o;===o o;==;;; ;;= ======;=o==  =o=;=o===; o==;o==   m = =m= == =o=;o;o===;=; ;== ; =;o=  = =;;=;==; =o;;o;;=;;o    ==o;====o   ==o=;;==;;;;; o= =;=;=; ;= =   ;o;=;=o=o;=;;==;;;;;;oo==;;  =;===;;=o =;=;= =;=;====om; ;==  ;;  == ;;= =;== = ;;= ;]

## Song Lyrics
Raw    :  3,138 [          o  o                  @-;    =    =~=      .    -    : : +==.. -= =;~:=::=--;          m=oo$===m=~mommo=MMmm=-~=                                      ; .     ;o:   .     =    . .       o                                                            ]
Zstd   :  1,724 [@B+m=mom=~;om;o+=:m;==$=~m=+o=m+m=+=m+m:mm=Mom=o==m==o=:==o=====m+o=+mmo=~++mo=o+=oo;~~+moo=~+===oM+==o+:+=~++o:B~+mm~=+o==+~~; =m=mm+m=mm=m~==;++m~~++mm==o==;m:+=o;m~o+=o~;om;=+~=m=~o+=++;~+=moo++omm=o~:=m=~o+=o+oom=~++o~=++om=+;m;=-+;=~~:+;=;oo~;+~~=;= =]
Zstdict:  1,858 [MM:oooBMB~~oomB;M~o=;-==o+;~Mo:oo~o-oo+ooo=o=B++==M;M~=~B.m-~-=o+@+o;+;;;~:o+:=M;m~=~oo=+=oo~~~=-o=-+;~.+M:=;+~=oM-:M+=;=o~-o:o==o~M~o=~+o~=:o;:=:~~o=++-== +-=mM+~-;~M+==oB=~m==o=:o=~;=:+:o;=;Mom-mo=;~;=:-=:~o~-Moo+=-~=:~-=~m=o==--+o+B; -=-o++=+=.-=om;o;~.]

## Random string of digits...
Raw    :524,293 [                                                =@o=======                                                                                                                                                                                                      ]
Zstd   :221,212 [@#BB$mMm$MmoBoo=$$MMmoo=BMooo=o=$BBMM=M=m=o+m+o+BMMMm=o+m==~o+=~BoMo$oMoM=+~M==+mo=+o~+~o++~=~~~MmMoM=m=m=+~o++~oo++o~+;o+~~=~~;$MmoBomo#MooMo==MM+==+++MM===+~+moo==~=~o+~~+~~;mo===~~~=+~~+~~;BomoBoo=M=+~m=++o=+=+~~~o++~+;~;ooo==~=~=~~;+;~;o=~++;~;=+~~+;;;]
Zstdict:221,212 [@#BB$mMm$MmoBoo=$$MMmoo=BMooo=o=$BBMM=M=m=o+m+o+BMMMm=o+m==~o+=~BoMo$oMoM=+~M==+mo=+o~+~o++~=~~~MmMoM=m=m=+~o++~oo++o~+;o+~~=~~;$MmoBomo#MooMo==MM+==+++MM===+~+moo==~=~o+~~+~~;mo===~~~=+~~+~~;BomoBoo=M=+~m=++o=+=+~~~o++~+;~;ooo==~=~=~~;+;~;o=~++;~;=+~~+;;;]

How to read: Each line has 256 caracs representing the count of bytes with this value in the buffer with palette .-:;~+=omMB$#@ (not very precise, but helpful to spot patterns).

What I see immediately, is that the compressed result for JSON and songs do not use all the bytes and are clumped in multiple groups. This same effect can be seen for the random digits where you see a sawtooth pattern with cycles of 16 or 32 in length.

Is this normal? May this is because the test documents are not long enough and compress already well enough to not require using the complete range of bytes? I would have expected that the random digit once comrpessed would have a close to even byte distribution (since the original text also as an even distribution, except 1 and 2 for some reason (probably artefact of the random generator used)

Cyan4973 commented 8 years ago

Lots of things to say in these interesting and fairly detailed tests.

Compression method

I suspect you compressed using the -1 aka "fast" compression method. In which case, the selected parameters for small docs are these ones :

/*         W,  C,  H,  S,  L,  strat */
    {  0, 14, 14, 14,  1,  4, ZSTD_fast    },  /* level  1 */

which means it will be happy as soon as it finds a 4-bytes match, and blindly accept that (since it's a fast method). Good, but in some cases, it could be that the cost of offset + length is effectively larger than just 4 symbols directly compressed with Huffman. It could also be that the "small" match found at position P is not as good as the better one that would have been found at P+1.

Both these reasons can explain why "Song" sample ends up being worse with dictionary than without.

As a test, if you have direct control over ZSTD_parameters, you may want to increase searchLength to 5, in order to force matches to feature a minimum decent length to be selected. It could play favorably for this sample ... and less good for others. I'm not sure to have a good "overall solution" at this stage. Blind selection is a key characteristics of the fast algorithm. It's basically required to be fast. So the last possibility seems to play with heuristics.

Random digits

It's a specific distribution. Interesting, but pretty rare in "normal" documents. In this case, the "match repeat sequence" is basically useless. So everything get throwned at Huffman.

However, 10 evenly distributed characters don't make a clean power of 2. So, some characters will get 3 bits, others 4 bits.

Consequently, the combination of those 3bits & 4bits symbols will not be perfectly flat. 4bits symbols appear more often than they should. Hence, byte distributions will be affected, with some values being more common than others.

It's one of those rare cases for literals were FSE would do a noticeably better job. I suspect compressing "digits" with huff0 directly will produce the same layout you discovered, while using FSE would produce a much flatter byte distribution.

Now, for me the issue is that this sample is too specific; conclusions don't necessarily help to produce an algorithm with better characteristics for the more general case.

Small files

JSON and Song samples are pretty small once compressed. That means that each byte has a visible impact in your statistics. Even with a noise source, it can be expected that some bytes values are unused when the sample is < 500 bytes long.

It also means that headers have a non-negligible share of the compressed result. I'm suspecting that headers + huffman compressed literals dominate the distribution within compressed file.

Huffman compressed bytes may be subject to the same issue as described earlier. I suspect the impact should be small though, although on small samples, even a small impact can become visible.

Literals may also not be compressed at all. If their numbers are too small, it's not worth trying due to the fixed cost of huffman headers. In which case, their distribution should look like the original source.

Headers have a specific structure, which tries to be compact, but also fast to encode / decode. The distribution of bytes values within headers is not guaranteed to be perfectly flat. I wouldn't be surprised if some values were more common than others, but haven't tested lately.

That being said, it's correct to say that non-flat distribution is a hint that compression hasn't reached its upper limit. Quite clearly, yes. But zstd is also about trade-off between compression and speed, so it's not obvious to draw a line between the two.

KrzysFR commented 8 years ago

You are correct, changing the search length to 5 does fix the compression for the Song test... but now compression of the random digits is worse (240KB for -6 versus 220KB for -1)

I used compression level 6 (because it has search lengtrh 5 for both < 16KB and large blocks):

## Generic JSON document
Raw    :  1,150 [                                $ @     :: :m~==ooo==;;==~m      =.==;+.:=~.=~=;=~+o=~-.--  +    m;=oM===o .o=ooo oooo..-=.. .     .                                     .                        ..                                                            ]
Zstd   :    691 [@o~=m~o: =~o~oo=m~~:o~==~=~~o~m:==~:m o~~oo~==~:=~om:m=== =~:~=:M=o==o=~:ooo~:~=oo=o~:o=~ =o::=~om=~=~o=:mm=m=o =~:===ooo~=~ =~=:o~~~:o~o= = ~=~o= o~=:::~m:~= o=o:::=oo:o~~ o=~o:o:oo:~:=:::o=o~ooom~=~~::= o~m:=o~:=:~==~~~: o=: o~ = o=o~~:~ :o~~::~ ::o~==~:]
Zstdict:    442 [@o=moo;=o=;=om=o===;m=;;oo=; o; moo;m=;;;;;=oo=o m==;=o= = m;;;= =o; ;=;om o;o= =o  ;o;;o;; =;  ===;;o= ;;;===;o;;=o;=o  = ;m;  o = ;;oo==;==;mo=;==o= ;;=;;==o=;o; ;;o o;= ; =moo=  = ;=o; ;=ooo==; ;=;; ;  ;; o  ;;o;= ==;=;; o; =; ;  ;===o;    = ;==;;;  o==]

## Song Lyrics
Raw    :  3,138 [          o  o                  @-;    =    =~=      .    -    : : +==.. -= =;~:=::=--;          m=oo$===m=~mommo=MMmm=-~=                                      ; .     ;o:   .     =    . .       o                                                            ]
Zstd   :  1,723 [$m$o+oo@Bo=o=~M=o~m+mmo+M~=~M;+o==:~oM+;o=m;-+;M=m+o=mm+=++;~==;mooM=:::oo~oB;+++m~o-=;.BMm:~~~o+moo:;o-o~m=+.-:o=;++:::+o+;o~=oooo;m~===;mM+:o+o:oo;+mo=~:o+=;~=B+=++o:=m;=B;:o~;;oMMo--o.:+m~;m+=+~M+:o=$oomoo;:+:o=o~;~;:+.;~o:o;~++~:;::~::+oo:=oo;;===.~oo-]
Zstdict:  1,701 [$B=M=~o~oMmo==;+m~m=oo=mooo~M~==moo=@o+:=o=+===o=m~mmm+;=B;;=~=om==o=~-:M==M=:-;o++=;m$=o=;=+o=;m+moom-Bo+o==:;o:;m==;;:=+mo~=++B~m~o+~=+=oo;~=+B+M~++~+=~===:m==o+++=+:=~===o;==o=o=o~=~=+:m=====~+ommom+ooo=+==~++=o==;:+:+~;:=;~o=~o:=o=o;~-o;m;=+M++m~~;o=~=]

## Random string of digits... (note: biased for 1 and 2!)
Raw    :524,288 [                                                =@o=======                                                                                                                                                                                                      ]
Zstd   :240,668 [@BMoMoomMmm=m==oMMmmo=momm==ooo=MmMoomm=mo==ooo+mmooo==+====o===MmMoMoo=mom=o==+o====+==o=======Mo==o===o==+===+o======+=====+=+MMmoMo==Mm==m===mo==o===m==+=+=+m===o===m==+=+=+=======+===+=+++Mom=ooo=o====+==o====+=+=+++=+=+o====+=+o==+===+o=+==+++==++++++]
Zstdict:240,670 [@BMmBmM=MmooooooMMmmo=mommo==oo=BMmmoom=moo==o==Mm==o===o===o==+$mBomooom=o=m=o=mo==o===o==+==++Mmmo===+oo=======o===+=+o+===+++BMmoB===mm======moo=o===oo=+o===m=o=m==+o==+===+o==========+=+++Moo=m===o===o==+o===o+==o+=+=+=+m======+===+==++o=+==+++==+==+++]

Also, the way the random digit string is constructed introduce a bias which should obey Benford's law: I'm appending random numbers (1, 123, 123456) to a string buffer, and not random characters, so 1 is more frequent than 2, than 3, etc... This is "visible" in the =@o======= distribution where digit 1 is marked with @ (max), 2 with o (75th percentile) and the rest with = (median). This bias in the distribution is probably still visible in the result.

Random digits are of course not very common... except when you use 128-bit random UUIDs as keys for documents, when you have strings of 36 almost random digits everywhere in your JSON or XML documents (ex: "fb20205a-d823-4068-baf2-91d5cd5ad54d", "4dce2acc-660f-4064-9a7c-4c40d53c352c", "ab911aa0-0682-41e3-a7b6-d9ccb56fa591"). Each source byte as 4 bit of entropy, so these 36 chars are worth the original 128 bits, but these 128 bits are not really compressible.

My experience with JSON is that the bulk of the really unique data are GUIDs, high precision real numbers (64 bit double 3.141592653589793238) and timestamps with very high precision (2015-12-16T19:29:23.8649501Z), though timestamps usually start with common prefix "2015-12-... or even "2015-12-16T19:.." if you are creating a lot of documents per hour.

It looks like the fields of ZSTD_parameters seem to really impact (negatively or positively) the compression ratios, but each parameter is not exactly link with the compressionLevel only (currently there are 4 tables that for the same "level" gives different search lengths). This means that I cannot use a constant compressLevel if I want very specific values for search length, but don't care for the rest.

It could be usefull to have a function that would probe various parameters and return the optimum set, given a specific sample document and time budget. Would be nice to have when designing the compression stage, especially for "small" documents (where zstd would be used with data stores or small REST http queries with JSON like payloads).

Cyan4973 commented 8 years ago

Random digits are of course not very common... except when you use 128-bit random UUIDs as keys for documents,

Very good point

This means that I cannot use a constant compressLevel if I want very specific values for search length, but don't care for the rest.

Indeed. compressionLevel should be considered a "default" set of parameters, trying to match a relative speed budget. But if you want full control, you need to use _advanced() variants, and set parameters as you want them.

It could be usefull to have a function that would probe various parameters and return the optimum set, given a specific sample document and time budget.

This function almost exists : a program paramgrill tries to match the best possible parameters combinations for each compression level. That being said, it currently works on the full compression level range (from 1 to 20). A leaner version, which only targets one specific speed, can probably be done from it.

Cyan4973 commented 8 years ago

There is a new version of zstd which is likely to help this use case :

zstd_static.h exposes new prototypes for dictionary compression : ZSTD_compress_usingDict() and ZSTD_decompress_usingDict(). It's just making use of dictionary content, no entropy transport yet. But one has to start somewhere ....

Pay attention that, as a consequence, ZSTD_compress_advanced() scope has been extended and its prototype changed, hence it could break your link stage if you were using it previously.

Cyan4973 commented 8 years ago

There's a new branch of zstd which is intended to help this use case. This branch is planned to become the foundation of v0.5.

The most visible addition is the presence of a dictionary builder tool, within directory dictBuilder. It's a stand-alone command line tool, which receives a list of samples (generally through command line expansion using * wildcard character) and generate a dictionary of selectable maximum size.

The process is as follows :

Create a dictionary file using dictBuilder : ./dictBuilder path/to/samples/* -o dictFileName
Compress using the command line zstd : ./zstd fileName -D dictFileName
Compress using API : ZSTD_compress_usingDict()
Decompress using API : ZSTD_decompress_usingDict()

If you have a lot of messages / blocks to compress using the same dictionary, it can be beneficial to load it only once, and then re-use it for all blocks.

For this, you need 2 contexts : one to load a reference, the other to "run" the compression/decompression operation :

ZSTD_CCtx* referenceCCtx;
ZSTD_CCtx* workCCtx;

ZSTD_compressBegin_usingDict (referenceCCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_copyCCtx(workCCtx, referenceCCtx);
ZSTD_compressContinue(workCCtx, ...);
(...)
ZSTD_compressEnd(workCCtx, dst, dstCapacity);

The process for decompression is similar though slightly simpler if you do decompression in a single step :

ZSTD_DCtx* referenceDCtx;
ZSTD_DCtx* workDCtx;

ZSTD_decompressBegin_usingDict (referenceDCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_decompress_usingPreparedDCtx(workCCtx, referenceCCtx, ...);

Preliminary tests show some great gains compressing small json records.

Finally, if you are going to compress very small records into even smaller compressed messages, every byte matter. You can get rid of frame metadata, at the cost of losing compatibility withzstd command line utility. It's possible to access data directly at block level, using the new block interface. Currently, it saves 11 bytes, which can make a non negligible difference when the compressed message is < 50 bytes. Note though that the caller will have to save or know metadata (compressed size, original size, compression method) by other means.

The previous examples become :

Compression :

ZSTD_CCtx* referenceCCtx;
ZSTD_CCtx* workCCtx;

ZSTD_compressBegin_usingDict (referenceCCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_copyCCtx(workCCtx, referenceCCtx);
ZSTD_compressBlock(workCCtx, ...);

Decompression :

ZSTD_DCtx* referenceDCtx;
ZSTD_DCtx* workDCtx;

ZSTD_decompressBegin_usingDict (referenceDCtx, dictBuffer, dictSize);   /* load only once */

/* loop on each block */
ZSTD_copyDCtx(workDCtx, referenceDCtx);
ZSTD_decompressBlock(workDCtx, ...);

Note that blocks have a hard limit of 128 KB max.

KrzysFR commented 8 years ago

Wow, this looks very promising! Can't wait to test this tomorrow.

A quick scan over the code of dictBuilder makes me realize that there is a LOT of things under the hood. Do you plan to talk about this at sometime in a blog post maybe? It would be really interesting to understand how this thing works.

Also, from the previous discussion, are you currently only building the dictionary? Or is this also dealing with removing the entropy table descriptor at the start of the compressed output?

Regarding the use of block API or not, in the past I needed to add my own prefix before compressed data to mark the codec type and original size, and it usually end up taking 4 to 6 bytes... which means that I would only gain 5 or 6 bytes per documents... Sticking with the regular format would probably be easier I guess. I see the block API being very useful when compressing blocks with fixed size or files with a known granularity (like compressing pages of a memory dump aligned to 4KB or 64KB pages, etc...

Cyan4973 commented 8 years ago

Do you plan to talk about this at sometime in a blog post maybe? It would be really interesting to understand how this thing works.

Good points, yes, I'll write a blog post about it.

Also, from the previous discussion, are you currently only building the dictionary? Or is this also dealing with removing the entropy table descriptor at the start of the compressed output?

It also deals with table descriptors.

I see the block API being very useful when compressing blocks with fixed size or files with a known granularity

Indeed, it's a good use case.

There are also situations where the size of records is not known but bounded by construction (ex : < 4 KB). And the size of the compressed blob is necessarily known, either because it is the size of the file, or because it can be derived from a jump table which is built for random access. As for the compression method, it can also be part of the jump table metadata. In which case, no additional information is required, so block interface can be used.

Well, anyway, these are just optimizations, could be attempted at a later stage.

KrzysFR commented 8 years ago

Ok, so I reused the same dataset and training sample from my earlier tests with femtozip, and dictBuilder produced a ~35 KiB dictionary, (compared to exactly 64 KiB for fzip).

The results I have

RAW: 41.2 MB
zstd -1 BATCH: 2.8 MB
zstd -1 1by1 no dict: 10.9 MB (in 251 ms)
zstd -1 1by1 v0.5 dict: 5.0 MB (in 259ms)

note: I'm not using the trick to copy the context yet, so the compression time is slower than it could be.

Here is a recap of this test and the previous ones:

raw dataset: 41.2 MB (1 : 1)
zstd 0.5.x no dict (worst case): 10.95 MB (1 : 3.76)
zstd 0.4.x w/ femtozip dict: 9.4 MB (1 : 4.4)
zstd 0.5.x w/ dictBuilder dict: 5.0 MB (1 : 8.2)
gzip -5 single batch (best case): 3.8 MB (1 : 10.8)
fzip: 3.2 MB (1 : 12.8)
zstd 0.5.x single batch (best case): 2.77 MB (1 : 14.9)

The 0.5.x branch gives a 2x improvement over the 0.4.x branch using a dictionary, which gets us down to about 50% of the best case scenario (compressing everything in a single batch).

Femtozip still compresses better, maybe because it is producing a larger dictionary?

Cyan4973 commented 8 years ago

Thanks for this valuable feedback @KrzysFR.

Indeed, there are a number of factors that could play to even the ground with fzip.

Dictionary size : yes it matters. There is a poorly documented feature within dictBuilder called the Selectivity Level. The idea being : the larger the number, the more elements get through the filter. (Sidenote : not even sure if the name is explicit enough. Could be changed though ...) Default level is 9. If you try higher values (-S10 for example), it should generate larger dictionary ( up to specified limit).
Compression level : it seems you tested zstd at -1 level, which is its fastest. But fzip seems to be closer to gzip -9. So it sounds fair to test some higher zstd compression levels. Try -4/-5 for some middle ground, -16 for higher compression. It makes a large difference.
Once you have settled for a desirable compression level, you can tune the statistics in favor of this particular compression level. Try option -L# with # being the target level. Difference is pretty small though, so it's more a finishing touch.
I also noted fzip advantage when it comes to very small data packets. I then realized that most of this advantage comes from fzip complete absence of header information. For small packets, it makes a sizable difference (about 20 bytes) ! Hence the suggestion to use block interface to get rid of frame headers. There are still about ~6 bytes I could probably remove if the objective was to make a static format completely specific to this application. Not sure if it's a correct direction for zstd though...

KrzysFR commented 8 years ago

The results with various compression levels (same dictionary):

zstd -1 : 5.0 MB
zstd -5 : 4.2 MB
zstd -16: 3.9 MB
fzip: 3.2 MB

Below are the distributions of documents sizes I used and after compression.

The smallest compressed docs are ~150 bytes compressed, so an overhead of 10 bytes per doc (6,360 docs in the set) would only get me ~60KB out of about 4 MB. Not very significant for my use case anyway.

ASCII histograms to the rescue!

RAW INPUT SIZE DISTRIBUTION:

Min/Max: 742 .. 20,126 , Average: 6,490
Median : 5,524 (+/-1,933), StdDev: 3,666
Distrib: (1,706.4) - 3,814.7 =[ 5,524.0 ]= 9,706.7 - (13,398.8)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      700 - 800      |        25 |   0.393% :                                                  |   0.393% `--------- |     18,750 |
  |      900 - 1,000    |        13 |   0.204% .                                                  |   0.597% [--------- |     12,350 |
  |    1,000 - 1,200    |         9 |   0.142% .                                                  |   0.739% [--------- |      9,900 |
  |    1,200 - 1,400    |       127 |   1.997% #                                                  |   2.736% (--------- |    165,100 |
  |    1,400 - 1,600    |        52 |   0.818% +                                                  |   3.553% (--------- |     78,000 |
  |    1,600 - 1,800    |       173 |   2.720% #+                                                 |   6.274% )--------- |    294,100 |
  |    1,800 - 2,000    |        72 |   1.132% x                                                  |   7.406% )--------- |    136,800 |
  |    2,000 - 2,500    |       263 |   4.135% ##.                                                |  11.541% -[-------- |    591,750 |
  |    2,500 - 3,000    |       170 |   2.673% #;                                                 |  14.214% -(-------- |    467,500 |
  |    3,000 - 3,500    |       275 |   4.324% ##:                                                |  18.538% -]-------- |    893,750 |
  |    3,500 - 4,000    |       653 |  10.267% #####.                                             |  28.805% --]------- |  2,448,750 |
  |    4,000 - 4,500    |       413 |   6.494% ###:                                               |  35.299% ---|------ |  1,755,250 |
  |    4,500 - 5,000    |       312 |   4.906% ##=                                                |  40.204% --->------ |  1,482,000 |
  |    5,000 - 6,000    |     1,189 |  18.695% #########;                                         |  58.899% -----]---- |  6,539,500 |
  |    6,000 - 7,000    |       437 |   6.871% ###+                                               |  65.770% ------)--- |  2,840,500 |
  |    7,000 - 8,000    |       223 |   3.506% #$                                                 |  69.277% ------]--- |  1,672,500 |
  |    8,000 - 9,000    |       123 |   1.934% #                                                  |  71.211% -------[-- |  1,045,500 |
  |    9,000 - 10,000   |       341 |   5.362% ##X                                                |  76.572% -------)-- |  3,239,500 |
  |   10,000 - 12,000   |       930 |  14.623% #######;                                           |  91.195% ---------[ | 10,230,000 |
  |   12,000 - 14,000   |       346 |   5.440% ##X                                                |  96.635% ---------) |  4,498,000 |
  |   14,000 - 16,000   |       141 |   2.217% #.                                                 |  98.852% ---------] |  2,115,000 |
  |   16,000 - 18,000   |        58 |   0.912% =                                                  |  99.764% ---------> |    986,000 |
  |   18,000 - 20,000   |        13 |   0.204% .                                                  |  99.969% ---------> |    247,000 |
  |   20,000 - 25,000   |         2 |   0.031% `                                                  | 100.000% ---------@ |     45,000 |
  `---------------------------------------------------------------------------------------------------------------------------------'

ZSTD -1 OUTPUT SIZE DISTRIBUTION:

Min/Max: 175 .. 2,030, Average: 791
Median : 695 (+/-201), StdDev: 344
Distrib: (337.4) - 535.7 =[ 694.6 ]= 1,032.8 - (1,450.2)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      160 - 180      |         9 |   0.142% .                                                  |   0.142% `--------- |      1,530 |
  |      180 - 200      |        16 |   0.252% .                                                  |   0.393% `--------- |      3,040 |
  |      200 - 250      |         8 |   0.126% .                                                  |   0.519% [--------- |      1,800 |
  |      250 - 300      |        29 |   0.456% :                                                  |   0.975% [--------- |      7,975 |
  |      300 - 350      |       342 |   5.377% ##X                                                |   6.352% )--------- |    111,150 |
  |      350 - 400      |       292 |   4.591% ##;                                                |  10.943% -[-------- |    109,500 |
  |      400 - 450      |       214 |   3.365% #X                                                 |  14.308% -(-------- |     90,950 |
  |      450 - 500      |       232 |   3.648% #$                                                 |  17.956% -]-------- |    110,200 |
  |      500 - 600      |     1,254 |  19.717% #########&                                         |  37.673% ---]------ |    689,700 |
  |      600 - 700      |       829 |  13.035% ######=                                            |  50.708% -----[---- |    538,850 |
  |      700 - 800      |       455 |   7.154% ###x                                               |  57.862% -----]---- |    341,250 |
  |      800 - 900      |       476 |   7.484% ###X                                               |  65.346% ------|--- |    404,600 |
  |      900 - 1,000    |       474 |   7.453% ###X                                               |  72.799% -------(-- |    450,300 |
  |    1,000 - 1,200    |       853 |  13.412% ######X                                            |  86.211% --------)- |    938,300 |
  |    1,200 - 1,400    |       495 |   7.783% ###&                                               |  93.994% ---------( |    643,500 |
  |    1,400 - 1,600    |       255 |   4.009% ##                                                 |  98.003% ---------] |    382,500 |
  |    1,600 - 1,800    |        89 |   1.399% X                                                  |  99.403% ---------] |    151,300 |
  |    1,800 - 2,000    |        35 |   0.550% ;                                                  |  99.953% ---------> |     66,500 |
  |    2,000 - 2,500    |         3 |   0.047% `                                                  | 100.000% ---------@ |      6,750 |
  `---------------------------------------------------------------------------------------------------------------------------------'

ZSTD -5 OUTPUT SIZE DISTRIBUTION:

Min/Max: 159 .. 1,789, Average: 672
Median : 589 (+/-164), StdDev: 290
Distrib: (302.7) - 452.2 =[ 589.4 ]= 873.7 - (1,234.1)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      140 - 160      |         1 |   0.016% `                                                  |   0.016% `--------- |        150 |
  |      160 - 180      |        24 |   0.377% :                                                  |   0.393% `--------- |      4,080 |
  |      200 - 250      |        25 |   0.393% :                                                  |   0.786% [--------- |      5,625 |
  |      250 - 300      |       246 |   3.868% #&                                                 |   4.654% |--------- |     67,650 |
  |      300 - 350      |       402 |   6.321% ###:                                               |  10.975% -[-------- |    130,650 |
  |      350 - 400      |       289 |   4.544% ##;                                                |  15.519% -)-------- |    108,375 |
  |      400 - 450      |       572 |   8.994% ####=                                              |  24.513% --|------- |    243,100 |
  |      450 - 500      |       697 |  10.959% #####=                                             |  35.472% ---|------ |    331,075 |
  |      500 - 600      |     1,033 |  16.242% ########.                                          |  51.714% -----[---- |    568,150 |
  |      600 - 700      |       527 |   8.286% ####.                                              |  60.000% ----->---- |    342,550 |
  |      700 - 800      |       470 |   7.390% ###X                                               |  67.390% ------)--- |    352,500 |
  |      800 - 900      |       657 |  10.330% #####:                                             |  77.720% -------]-- |    558,450 |
  |      900 - 1,000    |       504 |   7.925% ####                                               |  85.645% --------)- |    478,800 |
  |    1,000 - 1,200    |       550 |   8.648% ####;                                              |  94.292% ---------( |    605,000 |
  |    1,200 - 1,400    |       264 |   4.151% ##.                                                |  98.443% ---------] |    343,200 |
  |    1,400 - 1,600    |        77 |   1.211% x                                                  |  99.654% ---------> |    115,500 |
  |    1,600 - 1,800    |        22 |   0.346% :                                                  | 100.000% ---------@ |     37,400 |
  `---------------------------------------------------------------------------------------------------------------------------------'

ZSTD -16 OUTPUT SIZE DISTRIBUTION:

Min/Max: 153 .. 1,634, Average: 614
Median : 539 (+/-162), StdDev: 265
Distrib: (277.5) - 414.1 =[ 539.0 ]= 795.5 - (1,132.0)

   __________________________________________________________________________________________________________________________________
  |____[ Min , Max )____|___Count___|__Percent____________________________________________________|___Cumulative________|__Weight____|
  |      140 - 160      |        23 |   0.362% :                                                  |   0.362% `--------- |      3,450 |
  |      160 - 180      |         2 |   0.031% `                                                  |   0.393% `--------- |        340 |
  |      180 - 200      |         4 |   0.063% `                                                  |   0.456% `--------- |        760 |
  |      200 - 250      |        68 |   1.069% =                                                  |   1.525% [--------- |     15,300 |
  |      250 - 300      |       402 |   6.321% ###:                                               |   7.846% ]--------- |    110,550 |
  |      300 - 350      |       371 |   5.833% ##&                                                |  13.679% -(-------- |    120,575 |
  |      350 - 400      |       493 |   7.752% ###&                                               |  21.431% --[------- |    184,875 |
  |      400 - 450      |       806 |  12.673% ######;                                            |  34.104% ---(------ |    342,550 |
  |      450 - 500      |       718 |  11.289% #####x                                             |  45.393% ----|----- |    341,050 |
  |      500 - 600      |       752 |  11.824% #####&                                             |  57.217% -----)---- |    413,600 |
  |      600 - 700      |       426 |   6.698% ###;                                               |  63.915% ------(--- |    276,900 |
  |      700 - 800      |       738 |  11.604% #####$                                             |  75.519% -------)-- |    553,500 |
  |      800 - 900      |       586 |   9.214% ####x                                              |  84.733% --------|- |    498,100 |
  |      900 - 1,000    |       393 |   6.179% ###.                                               |  90.912% ---------[ |    373,350 |
  |    1,000 - 1,200    |       394 |   6.195% ###.                                               |  97.107% ---------) |    433,400 |
  |    1,200 - 1,400    |       143 |   2.248% #.                                                 |  99.355% ---------] |    185,900 |
  |    1,400 - 1,600    |        38 |   0.597% ;                                                  |  99.953% ---------> |     57,000 |
  |    1,600 - 1,800    |         3 |   0.047% `                                                  | 100.000% ---------@ |      5,100 |
  `---------------------------------------------------------------------------------------------------------------------------------'

facebook / zstd

Compressing individual documents with a dictionary produced from a sampled batch to get better compression ratio #97

Context

Goal

Idea

What I've tested so far

What I'm not sure about

Compression method

Random digits

Small files