Open EamonNerbonne opened 5 years ago
Hi, i thought that training is mostly done with the console application that comes ith ZSTD. Do you think that this should be done with the .Net library? Regarding the Span
Well "should" - that depends on the use case :-).
But yeah, for me it would be nice. I'm intending to use this to compress documents in what's essentially a document-database, and that means that the dictionary is dynamic: it's going to be based on a sample of actual data; and there are likely going to be a bunch of dictionaries (clustered somehow, e.g. based on document type and/or client), and the dictionaries are likely to be occasionally regenerated (to adapt to changing data distributions or simply leverage the fact that time is a reasonable predictor for a compressor).
But even for a fixed database it's a little simpler if it's possible to use the same tool to train the data as to use it.
I mean, for some people this is purely a disadvantage, because it causes some amount of library bloat. But if you're really going to leverage the small-content advantages dictionaries provide you kind of want to be able to make dictionaries. The size bloat appears to be fairly simple, based on the fact that https://github.com/skbkontur/ZstdNet/tree/master/ZstdNet's version of the dll's are actualy much smaller than the current 1.3.8 dlls; and in any case if you really care about size then a more significant win is to pick a bit-ness rather than include 32 and 64bit both. But I haven't checked yet what the bloat is using the 1.3.8 version of the codebase.
I noticed that this library and the related https://github.com/skbkontur/ZstdNet/tree/master/ZstdNet have only partially overlapping sets of functionality.
Are you interested in external contributions to fill out the gaps; and if so, how do you want those?
I could think of
ZDICT_trainFromBuffer
(this would be hugely useful to me, but may require a different compilation of libzstd.dll, since the prebundled release at https://github.com/facebook/zstd/releases aren't compiled with optional dictBuilder package.Span<T>
instead ofStream<T>
, at least, under the presumption that benchmarks show this amounts to any kind of meaningful perf win.TL;DR are you interested in contributions, and if so how/what kind/etc?