burmanm / gorilla-tsc

Implementation of time series compression method from the Facebook's Gorilla paper
Apache License 2.0
210 stars 37 forks source link

Still being maintained? And a few PRs if so #18

Open jhovell opened 1 year ago

jhovell commented 1 year ago

Notice this library is in use in a few OSS projects but the issues are quite old.

I had a few PRs to make but seeing as there isn't activity here in quite a while I was wondering @burmanm if you were interesting in passing the reins for maintenance elsewhere.

PR's I wanted to open:

Thanks for the effort to create this library it's been quite useful for us!

burmanm commented 1 year ago

Hey,

Mostly I think the library is mature - don't want to touch it too much to not break stuff. However, for your first question, the 5 bit format is the old default one (if you use Compressor / Decompressor classes) and uses 5 bits: https://github.com/burmanm/gorilla-tsc/blob/master/src/main/java/fi/iki/yak/ts/compression/gorilla/Compressor.java#L173

The 6 bit one is the more advanced one (GorillaCompressor / GorillaDecompressor) as it uses slightly different storage format with some optimizations for performance and long values.

Changing the format is not as straightforward as it may seem as that would break any existing data and users. As for float instead of double, in theory you can achieve this already by keeping the bits in correct order (as the data is stored by using Double.doubleToRawLongBits(double)) and that way getting the proper compression benefits. There's a long value method format also, so you could just do Float.floatToRawIntBits(float) before calling the method. I'm not sure if creating yet another format derivative would be beneficial for that (I would have to test if there's any real storage savings to achieve).

jhovell commented 1 year ago

So to explain my use case the data I am needing to decode is already encoded by this Python project so I can't really change the way the data is encoded.

https://github.com/ghilesmeddour/gorilla-time-series-compression

If there is a way to access to old 5 bit way of encoding values on the decoding side I'll take a look at that and try it out. I was unaware it was possible.

As for float32, again if there is a way to achieve support there without changes I'll try that out. Maybe I could improve some documentation if so?

Either way, of course I would want compatibility to be preserved. I saw this as a different option/config that you could run in ... maybe that is what the non-gorilla classes are. I'll see if I can do some experiments and get it to work.

burmanm commented 1 year ago

Yes, you should be able to decompress the original 5 bit format using the Decompressor:

        ByteBufferBitInput input = new ByteBufferBitInput(byteBuffer);
        Decompressor d = new Decompressor(input);

For float32, it should be possible by doing the rawToBits conversion outside before pushing it to the compressor. I haven't tested, but along these lines:

GorillaCompressor compress = ...;
float input = 0.0f;
int bitsToStore = Float.floatToRawIntBits(input);
compress.addValue(timestamp, bitsToStore);

(might require some casts to long, but the general idea should be that). The principle here is that only the first 32 bits are used and rest are set to 0 even if we store it with the long method. When that happens, the XOR compression should not really care about the zeroes at the end. The small perf hit from using long vs int should not really matter here and there shouldn't be any compression ratio hit.

jhovell commented 1 year ago

Yes, you should be able to decompress the original 5 bit format using the Decompressor:

So in my use case I have hundreds of different data to track for a single timestamp so encoding as timestamp/value pairs isn't very efficient. I am using the ValueDecompressor directly. If I copy the existing ValueDecompressor class and change

storedLeadingZeros = (int) in.getLong(6);

to

storedLeadingZeros = (int) in.getLong(5);

... I can decode my data perfectly with code like the following:

FiveBitValueDecompressor vd = new ValueDecompressor(input);
vd.readFirst(); // read the first value
vd.nextValue(); // read subsequent values

I don't see any equivalent class or setting I could use in the existing library to achieve this result. Decompressor / GorillaDecompressor seem to only work with timestamp/value pairs.

For the float32/float64 issue, a similar fix is needed. Again in ValueDecompressor the hard-coded appearances of Long.SIZE need to be replaced with Integer.SIZE and the leading zeroes moves down 1 more to either 4, or 5 if you're going with the "6 bit" encoding style. I am not sure how I could achieve this with the current library.

My suggestion / PR idea would be to augment ValueDecompressor with 2 constructor options (2 additional constructors) where the leading zeros and number of bits of Int/Long are set in the constructor rather than hard coded. This approach would not affect the existing API or break any current users. Another idea would be to extend the class with float and 5 bit versions (and I guess a combo of both) but that seems like it would get out of hand. (And similar compression support would be logically added, though in my case it is not needed.)

final leadingZeroes;
final bitLength;
public ValueDecompressor(BitInput input) {
    this(input, new LastValuePredictor());
}

public ValueDecompressor(BitInput input, Predictor predictor) {
    this(input, predictor, 6, Long.SIZE);
}

public ValueDecompressor(BitInput input, leadingZeroes, bitLength ) {
    this(input, new LastValuePredictor(), leadingZeroes, bitLength);
}

public ValueDecompressor(BitInput input, Predictor predictor, leadingZeroes, bitLength) {
    this.in = input;
    this.predictor = predictor;
    this.leadingZeroes = leadingZeroes;
    this.bitLength = bitLength;
}

I just think this would increase the flexibility to operate with other libraries in real-world scenarios where differences in the implementation of Gorilla exist. Again it is paramount to not break any existing functionality but just require more flexibility so the library can be used in more use cases... even if something could be done so a user could extend a class to their needs that would be great but with the hard-coded 6 and 64 that doesn't seem currently possible. I just have to copy the class and create a slightly different implementation.

What do you think?

jhovell commented 1 year ago

Bumping to see if there is any interest in a PR to address these cases above ^^ thank you 🙇

burmanm commented 1 year ago

I have no idea where my reply went, so I'll try to write again..

I don't see any equivalent class or setting I could use in the existing library to achieve this result. Decompressor / GorillaDecompressor seem to only work with timestamp/value pairs.

The entire idea behind the algorithm is to store timeseries data in a streamed fashion. Each serie should have its own data structure as otherwise the ability to use XOR would not be useful. If you store multiple different series one after another, the compression ratio will be pretty bad. Even more, the best usecase is type of data that doesn't change a lot, for example the original target of system monitoring data.

If I copy the existing ValueDecompressor class and change

If you wish to use 5 bit leading zeroes, then the other class still does that:

https://github.com/burmanm/gorilla-tsc/blob/master/src/main/java/fi/iki/yak/ts/compression/gorilla/Compressor.java#L173

For the float32/float64 issue, a similar fix is needed. Again in ValueDecompressor the hard-coded appearances of Long.SIZE need to be replaced with Integer.SIZE and the leading zeroes moves down 1 more to either 4, or 5 if you're going with the "6 bit" encoding style. I am not sure how I could achieve this with the current library.

You can achieve the 32 bit integer number support by using the other class (with the downside of not being able to encode 64 bit values). The GorillaCompressor and GorillaDecompressor were intended to be modifications that provide the ability to store 64 bit values. If there's no need to do that, then use the original format. Is there a reason you need to use the newer 64 bit long format without wanting 64 bits?

Float/Double makes no difference in the algorithm requirements, they work fine with both 5 and 6 bits due to the way floating points are constructed. All that's stored is float/double in integer format, as bits - not as numbers.