Feature request: trace compression

hmedina commented 5 years ago

Similar in spirit to #598, but requiring a whole lotta computer science!

A trace can be a very large file, notably as they're written in JSON. Traces of biological-style models at 1:1 scale frequently exceed tens, sometimes hundreds, of gigabytes. Unless the dev's come up with a less verbose representation, some are traces require dedicated disk drives...

Having a more succinct representation of traces would allow bigger and more ambitious models. One avenue is through compression. As the traces are streamed to disk, compressing them would require stream compression. Likewise, trace querying would require streaming from the compressed trace, without requiring the file fully decompressed.

yarden commented 5 years ago

Part of the problem is that JSON isn't really the right format for a trace. But if one were to stick with something like JSON, there's a hacky solution that doesn't require any sophisticated computer science.

One could simply output a gz compressed trace (that's trivial) which would give ~100x reduction in file size, as we discussed. But instead of using plain JSON, Kappa would output simply a line-separated JSON format (which technically speaking isn't JSON).

Right now, traces are just one big blob (which highlights the fact JSON was not meant to be used for such large data, in my view). If you instead separate it out by lines, it is trivial to read the trace in streaming fashion. You just read the trace line by line. It's easy to read a gz file in streaming fashion, so you'd never need to fully uncompress the trace. Of course, your trace query language or whatever means of querying the trace would have to work in streaming fashion too.

P.S. There are libraries that try to provide streaming access to json files (though in my opinion they are clunky), like ijson, so theoretically there's an immediate solution that wouldn't require any change to Kappa: you'd gz compress the trace by hand, that would save space, and then read from that compressed file in streaming fashion using something like ijson and compute what you need from the trace. If you're using Trace Query Language, I'm assuming it'd need to be adapted. To have alternatives to TQL, we'd need docs for the trace format (https://github.com/Kappa-Dev/KaSim/issues/601).

hmedina commented 5 years ago

A trace with linebreaks would be great for user-inspection! Next time I get an error, the line number will actually mean something, instead of being line 0, character 954357852515 lol

In my view, the central problem is that the whole trace may not fit in memory, and the contents are not know when the file is created. That means one can't have a good dictionary / compression table at startup. And the trace may not fit in disk, so I can't compress the file, because I never get to write the file... I guess buying an external SSD is cheaper and faster than some Dev time

When reading on stream compression (on a probably outdated wikibook...) people mentioned the existence of block-less adaptive / converging streams, which don't have blocks, and where the dictionary / compression table changes through the stream, getting better as it learns the shape of the data in the file. The absence of blocks sounds to me like it would avoid arbitrary discretization of the file, which would be required for trace querying (which already streams through traces).

Ideally, KaSim would start producing a trace file with some initial dictionary based on JSON structure, and append to that compressed file as more events happen, updating the compression table as it goes. That way the uncompressed trace never needs to be seen.

yarden commented 5 years ago

A trace with linebreaks would be great for user-inspection! Next time I get an error, the line number will actually mean something, instead of being line 0, character 954357852515 lol

In my view, the central problem is that the whole trace may not fit in memory, and the contents are not know when the file is created. That means one can't have a good dictionary / compression table at startup.

That's not really true, at least not if you consider ~100x compression good enough, which is what you'd get with gz. I consider that to be the minimal, least sophisticated compression.

And the trace may not fit in disk, so I can't compress the file, because I never get to write the file... I guess buying an external SSD is cheaper and faster than some Dev time

With the trivial change to KaSim, you'd only ever write the gz compressed version. The point is that if you have a ~1 GB trace, there's never a reason for it to occupy more than tens of megabytes. You don't need to write the full file to compress it; you just have KaSim serialize a compressed file.

When reading on stream compression (on a probably outdated wikibook...) people mentioned the existence of block-less adaptive / converging streams, which don't have blocks, and where the dictionary / compression table changes through the stream, getting better as it learns the shape of the data in the file. The absence of blocks sounds to me like it would avoid arbitrary discretization of the file, which would be required for trace querying (which already streams through traces).

Ideally, KaSim would start producing a trace file with some initial dictionary based on JSON structure, and append to that compressed file as more events happen, updating the compression table as it goes. That way the uncompressed trace never needs to be seen.

The uncompressed trace never needs to be seen even with gz.

hmedina commented 5 years ago

A trace with linebreaks would be great for user-inspection! Next time I get an error, the line number will actually mean something, instead of being line 0, character 954357852515 lol In my view, the central problem is that the whole trace may not fit in memory, and the contents are not know when the file is created. That means one can't have a good dictionary / compression table at startup.

That's not really true, at least not if you consider ~100x compression good enough, which is what you'd get with gz. I consider that to be the minimal, least sophisticated compression.

How do you compress something that doesn't exist yet? Don't you have to first create the trace block, then compress it? Or can the trace block come into being already compressed?

And the trace may not fit in disk, so I can't compress the file, because I never get to write the file... I guess buying an external SSD is cheaper and faster than some Dev time

With the trivial change to KaSim, you'd only ever write the gz compressed version. The point is that if you have a ~1 GB trace, there's never a reason for it to occupy more than tens of megabytes. You don't need to write the full file to compress it; you just have KaSim serialize a compressed file.

When reading on stream compression (on a probably outdated wikibook...) people mentioned the existence of block-less adaptive / converging streams, which don't have blocks, and where the dictionary / compression table changes through the stream, getting better as it learns the shape of the data in the file. The absence of blocks sounds to me like it would avoid arbitrary discretization of the file, which would be required for trace querying (which already streams through traces). Ideally, KaSim would start producing a trace file with some initial dictionary based on JSON structure, and append to that compressed file as more events happen, updating the compression table as it goes. That way the uncompressed trace never needs to be seen.

The uncompressed trace never needs to be seen even with gz.

What was the input to gz, if it is not the uncompressed trace?

yarden commented 5 years ago

How do you compress something that doesn't exist yet? Don't you have to first create the trace block, then compress it? Or can the trace block come into being already compressed?

You write to a file as you're creating the thing, except the file you're writing to is gz. See this for a Python example. It's quite standard in Unix to pipe things into and out of gzip files line by line (see here). This is widely used in bioinformatics and it applies here. There's never any reason, for instance, to keep a FASTA/FASTQ sequence file in uncompressed form. You can always keep it in gz and uncompress it on the fly. And obviously nobody loads gigabytes of sequence data into memory.

The uncompressed trace never needs to be seen even with gz.

What was the input to gz, if it is not the uncompressed trace?

The trace structure is built up through time by KaSim, obviously, and as it is being created you keep writing it to a gzip file. The result is a gzip file. The uncompressed trace never exists in full form either in memory or on disk.

hmedina commented 5 years ago

Ah ok, we're talking about the same thing; I wasn't aware the standard linux tool was capable of streaming. Hopefully there is an OCaml library with such functionality.

pirbo commented 5 years ago

There is (it’s called ‘OCaml-decompress’) and I started to look into it but this is vacation time so I’ll hack something only mid September. :-) Your suggestion are very valuable anyway (I love the idea of optionally populating a database instead of a folder on the disk!). Don’t stop making them but be aware that if I do care, I’m off line for a while...

Le 16 août 2019 à 02:15, Hector Medina notifications@github.com a écrit :

Ah ok, we're talking about the same thing; I wasn't aware the standard linux tool was capable of streaming. Hopefully there is an OCaml library with such functionality.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Kappa-Dev / KappaTools

Feature request: trace compression #599