asciinema / discussions

Public open-ended discussions about asciinema
https://github.com/orgs/asciinema/discussions
1 stars 2 forks source link

Improved compression #85

Closed bkil closed 1 year ago

bkil commented 2 years ago

The present encoding is good as a YAGNI proof of concept and brotli will compress it somewhat as well. It is good enough for command line or short recordings. However, it could use quite some storage and bandwidth for interactive real time applications like long multiplayer game sessions.

Hence, some more advanced algorithms could be tested, inspired by the MPEG family of compression methods. For example, we could memorize B/P reference frames for a certain number of seconds and store the difference based on those.

We could compress and even store the colors and character styling separately from text content. This would also allow viewers an option for a low-fi, bandwidth-efficient access method.

The recorder could have an FPS-cap option that would allow for merging changes of rapid succession to joint updates. The cuts could also be chosen adaptively for extra gains.

polarathene commented 2 years ago

It would be good to include examples of what you want to optimize compression wise, and what the current solutions offer. Then as you experiment with improving on that, there's a good baseline to compare to.

I've only had experience with short CLI example recordings, where a 100KB GIF was replaced by a 114KB SVG, but with gzip encoding (since Github is the asset delivery source) it was 100KB vs 7KB compressed.

Have you compared your larger recordings with SVG animations? There is a discussion issue for supporting that better here.

If you have a larger set of data, you could also see how well it compresses with zstd with a dictionary trained against the input (be that SVG or a different format). That may be simpler to implement/maintain depending on requirements.

bkil commented 2 years ago

See an example recording here:

I wouldn't call it "huge", but after reading through the JSON, a lot is left to be desired. I have specifically used a tiny resolution and have recorded many sessions to select the smallest and shortest one.

SVG is out of the question for me, as the whole selling point of asciinema is to also be able to play back the recordings from the terminal ("a tildeverse alternative to YouTube") and to maintain the ability to copy & paste.

I know this comparison is not fair, but the curseofwar executable is just 50kB, so if our shared gaming sessions weigh in the megabytes, it feels a bit wrong. Surely, it would be the most compact if it supported replays itself and then it could be delivered together with the game in wasm, but that's besides the point.

polarathene commented 2 years ago

and to maintain the ability to copy & paste.

You can still copy/paste text content with the SVG, although for the animation you'd need to pause it as it just animates through frames (via CSS).

the whole selling point of asciinema is to also be able to play back the recordings from the terminal

That's fine too :+1:


I wouldn't call it "huge", but after reading through the JSON, a lot is left to be desired. I have specifically used a tiny resolution and have recorded many sessions to select the smallest and shortest one.

What are your expectations? The 4.5MiB .cast file was compressed to a little over 700KiB (gzip). With zstd that goes down to around 370KiB. It's insufficient data to train zstd for, but I got it down to 310KiB with a trained dictionary.

I wanted to compare against an SVG animation, but it seems that tool is very inefficient. It used over 4GB of RAM processing it at which point it failed.

I know this comparison is not fair, but the curseofwar executable is just 50kB, so if our shared gaming sessions weigh in the megabytes, it feels a bit wrong.

Alphanumeric characters would not have much filesize weight, but large documents of text using them would definitely be larger, you're not going to get crazy efficiency as the associated data to construct the content is not part of the alphanumeric characters.

Likewise for a game, your disk storage is different from RAM, which for a game only needs to use memory to handle the current state, not retain it, but your recording does need to keep enough information to do so. Different story with 3D games where the game sizes and memory used can be huge compared to a 2D video (or data to record a replay internally which could be even lighter). Point being the disk/memory requirements of the game itself doesn't mean much for size of a recorded session.


I am not familiar with asciinema .cast format, but looks like shell output for each line with only the first line containing actual JSON. Afterwards it's just [<timestamp>, "o", <terminal output line>].

You could optimize that format for sure, but I'm not sure how much more efficient it'll compress. <timestamp> would be better moved to the start of the file as a series of deltas, or you could probably convert them to frame timings. Then seek/lookup one from the start/end range, you could then lookup the binary offset by the index of the <timestamp>. They'd probably compress better this way as a Struct of Arrays, than Array of Structs format.

<terminal output line> is mostly text, you could extract out the colours and such if you like, but I think a compressor like zstd can pickup on common repetitions and encode/decode that for you. I'm assuming there is other information (metadata) in this line too? One benefit to extracting that out might be that repeated characters would compress better.

With fixed width/height, you could also try encoding rows or tiles, to cache/recycle unchanged portions (assuming that content for such in the current format is repeated), but again this is probably something a compressor with a large enough window may already handle partially? Which is probably similar to the video encoding techniques you referenced earlier.


So 4.5MiB to 310KiB is the current compression result (closer to 400KiB without a trained dictionary). Is that really not sufficient for you?

Especially since AFAIK zstd can also stream/seek that compressed data? (not the seek you'd expect for playback exactly, but could possibly be adapted)

How much more efficient do you want beyond that, and is it worth the time/effort to attempt it? It definitely looks viable if switching to SoA format with binary representation instead of text.

polarathene commented 2 years ago

I managed to convert the 4.5MB .cast to SVG via a different project termtosvg, it took a while (around 10 minutes?) and about 2GB of RAM.

The SVG output was about 82MB and attempting to clean that up with SVGO was using over 3GB of RAM, I chose to terminate that before it'd complete/fail.

zstd compression (level 3) was down to around 3MB. I didn't continue further as the SVG size and memory required to convert to SVG seemed pretty bad. That .cast file was over 39k lines for 5 min duration, vs my own much smaller 9KB .cast with only 200 lines for a 30 second recording :sweat_smile:

Just thought I'd share that, even though SVG was ruled out earlier for your requirements.

bkil commented 2 years ago

Motivation

I'm trying to represent multiple business requirements with this optimization. Even now that asciinema is not widespread, having few users, storage costs have been asserted in operations and when enacting limits. Imagine if it reached the scale of YouTube - it sure couldn't be hosted for gratis.

If a hobbist would want to self-host such a service without cost indefinitely, they would probably have a web hosting of about 50-500MB at their disposal. If you also store other materials, this would barely suffice to store a single capture of the most well known ascii games, let alone showing multiple strategies or multiplayer matches in one or the other. Sharing a number of matches or a whole a playthrough would amend to application specific compression for gains of orders of magnitudes. And we haven't even mentioned non-games or hosting content for your friends.

Such a system also begs to be used by those who are very low on monthly data allowance to replace watching streaming video tutorials. For them, every megabyte literally counts. Whole countries could gain from this, along with offline-first communities where you preload every educational material during installation and you update the set at most annually. If you assume a 80GB HDD, that's not a lot of material.

General compression

I think you are pushing zstd too far (but then, why not go all the way with PAQ?). However, compression could be improved quite a bit by compression filters you have hinted at above for separating the delta timing parts from the terminal characters.

We could improve it yet again if the terminal characters could be stored by similarly zoning the stream:

Note that as ncurses updates the screen with a minimized diff of commands, it already counts as a kind of poor pre-compression, and pre-compressed data is prone to compress badly in further passes. So it might also improve overall compression if we compressed full frames of the console for applications that internally refresh the whole screen based on a timer or on user input.

Domain specific compression

For general compression, my suggestion above was to introduce tunable and adaptive lossiness to the process in the temporal dimension:

Application specific compression

What is the state of the art you ask? The curseofwar executable itself compresses to ~20kB (plus libncurses and libtinfo, but that would need to be reimplemented in JavaScript anyway). It could be stored once per server and then a recording may reference it by hash. Its simulation is deterministic in that it can be seeded with srand() via an argument. If the executable can be pinned, there is no reason to complicate the replay format much more (i.e., by attempting to describe the state space instead).

The generated output only depends on which key a user presses and at which simulation step, taking ~3 bits: left, right, up, down, build, flag, (remove all flags, remove 50% of flags). As user actions are sparse, the timing could be encoded via a variable length discrete step delta as well (with arithmetic coding). Sharing the exact cursor movement is usually not important, as players are usually doodling while they wait for resources or expansion, although, the recorder might sometimes want to show population thresholds at certain key locations and times during a match.

So, for most use cases, the representation could also be reduced to coordinate (~9 bits) vs. action (1 bit) along with timing. The example included 2418 simulation steps (why not that many rows in the cast then?) According to grep, it contained about 33 build commands and at most 326 flags (loose upper bound, but less than this). I think it would be a reasonable estimate that ~300 total commands were invoked (build, flag and unflag). That would make the average delta time ~10 frames, let's say taking 5 bits on average. That weighs ~15 bits/action, so a replay would take about 600 bytes, along with the the executable and its parameters (including the random seed). That's the ball park of the lower bound of what entropy such a replay should be backed by.

polarathene commented 2 years ago

If a hobbist would want to self-host such a service without cost indefinitely, they would probably have a web hosting of about 50-500MB at their disposal.

For $5 a month you get plenty of storage, and if you need more you can scale that. Additionally, if the increase in latency is acceptable, one can pull from storage on their own local machines (eg a cheap raspberry pi connected to an external USB SDD or NAS). You could even host the entire server locally instead of just storage if preferable.

Then you have the cost of a domain, which again you can avoid the annual renewal cost if you prefer. Or you can just go with an IP, which is a non-issue if you have a client app instead of direct browser. It really just depends on your needs, scale and audience.

If you have a large community, you can also leverage torrents. In my experience managing a community server for users to upload game mods and related features, bandwidth is a bigger concern for us than storage (of which we use 200GB or so over the past 6 years with an average peak of 1k users online at the same time), we can exceed 10TB traffic a month. That said our content is considerably larger in size than what you're interested in.


I think you are pushing zstd too far (but then, why not go all the way with PAQ?).

PAQ if I recall correctly is much slower to compress/decompress? zstd is not only good at compression ratio (not the best), it's notably better at the speed of doing so. And like I mentioned, there is streaming + seek support for decompression.

I don't see how it's pushing it too far? The example you provided of 4.5MiB was reduced down to 310KiB, 7% of the original size already.


So it might also improve overall compression if we compressed full frames of the console for applications that internally refresh the whole screen based on a timer or on user input.

Yes, I can understand how that could be beneficial.


so a replay would take about 600 bytes, along with the the executable and its parameters (including the random seed). That's the ball park of the lower bound of what entropy such a replay should be backed by.

Ok, so to clarify:

Is it still intended to use the asciinema cast format as input, or are you intending to implement an entirely new recorder + format?


I wrote a little program in Rust to process your linked cast. Nothing special beyond restructuring the data a bit and using binary encoding. I naively de-duped the content into unique blocks for lookups (those uniques still have a fair amount of duplicated content across them):

Stats:

A good amount of weight was trimmed off from the escape code prefix \u001b (5 bytes) being frequently used, but zstd or other compressors would have likely taken care of that automatically.

Prior to merging lines into frame timings, there was a few lines that split the ANSI escape codes across two lines. If encoding time/content losslessly it is about 1.5MiB or 370KiB with zstd compression. So most of the win I guess was from reducing the timestamp to frame numbers with lossy encoding as zstd could manage that prior on the JSON lines without my program :/

Let me know if the source code would be of interest to you (it's inefficient/rough, but enough to perhaps play around a bit).

ku1ik commented 1 year ago

Finally got reading this thread. Lots of interesting ideas and background here, I think I'm getting where you're coming from.

First, I think you're oversimplifying what writing to a terminal actually does @bkil :)

Each individual write recorded by asciinema as [ts, "o", "hello"] event is a stream of commands for a terminal. In fact there's 5 commands here, one for each letter. Sometimes letters are mixed with escape characters, longer control sequences etc, but in general the data is fed into a non-trivial state machine:

image

From: https://www.vt100.net/emu/dec_ansi_parser

We could compress and even store the colors and character styling separately from text content. This would also allow viewers an option for a low-fi, bandwidth-efficient access method.

When printing to a terminal color attributes are not assigned directly to text, but instead every terminal keeps "current pen color" that is altered by SGR sequences (commands), which affects text printed in the future but not any specific text. You can't reliably separate those without performing actual terminal emulation yourself, and using virtual screen buffer (this is what asciinema-player does).

The recorder could have an FPS-cap option that would allow for merging changes of rapid succession to joint updates. The cuts could also be chosen adaptively for extra gains.

Similar to above, you can't merge updates without actually interepreting the data with an emulator if you want to reduce the bandwidth (while reducing the frame rate). Of course you can just concatenate subsequent [ts1, "o", "hello "] + [ts2, "o", "world"] into [ts1, "o", "hello world"] but this doesn't bring dramatic size reduction.

would be better moved to the start of the file as a series of deltas, or you could probably convert them to frame timings.

This is how it pretty much was in asciicast v1 format: https://github.com/asciinema/asciinema/blob/develop/doc/asciicast-v1.md#frame

Full timestamp in v2 was chosen for several reasons (I can dig the PR that includes reasoning behind this if you want), and this change might have brought slightly bigger file sizes, but I'd say the increase is minor in practice.

With fixed width/height, you could also try encoding rows or tiles, to cache/recycle unchanged portions (assuming that content for such in the current format is repeated)

We could quantize frames to produce less frames per second by aggregating screen content - relatively easy

Again, this would require emulation with virtual screen buffer, which currently is not part of the cli. It's hard/impossible to make sense of line/cell (char grid in general) just by "extracting" some data from the output stream - you need to interpret the control sequences which is what terminals do.

Imagine if it reached the scale of YouTube - it sure couldn't be hosted for gratis.

I wish! But it's never gonna happen, it's too niche :)

Such a system also begs to be used by those who are very low on monthly data allowance to replace watching streaming video tutorials. For them, every megabyte literally counts.

As I mentioned in https://github.com/asciinema/asciinema/issues/516#issuecomment-1510423840 - Average recording size on asciinema.org is 154kb uncompressed (average of 468092 uploads). Then, when served to the browser it's sent gzipped. asciicasts compress super well, to roughly 7% of original file size with gzip. This means that on average the browser fetches 11kb of data to play full recording.

Your recordings seem to be on the bigger side, but still, you can store them on disk compressed, and you can serve them in compressed form to the browser. If you choose gzip compression than you can directly tell webserver, e.g. nginx, to serve those .gz files as-is with Transfer-Encoding: gzip header. asciinema-server and https://asciinema.org serve asciicasts gzipped by default.

We could introduce a few common high level drawing commands, for example scrolling text, blinking, or only changing colors of parts

This is what control sequences are for, and indeed we have ones for scrolling, blinking amongst many others.

storing multiple reference frames and transmitting the diff of either the textual or chromatic content.

In fact data in each [ts, "o", data] line sometimes is a minimal diff if the application running in a terminal is optimized for that (like you mentioned ncurses is trying to perform minimal amount of updates, although not all applications use ncurses). In theory it would be possible to optimize data to contain minimal diff, if 1) we included embedded terminal emulator (e.g. avt) in the recorder and 2) such emulator had ability to produce minimal diff between 2 screen states.

I think the "Application specific compression" description finally clicked for me, and I understood your use case and the ideas you presented. And I believe I have a promissing solution for you:

Custom recording format parser

With the above you could create a 600 byte file (but not with asciinema recorder unfortunately, although you may try creating a script to convert asciicast file into your own efficient format) and implement a custom parser that interprets those 600 bytes in your application specific way. You would need to translate this bespoke encoding of yours into regular terminal control sequences for cursor movement and color management but it shouldn't be too hard. See the Star Wars example in the linked parser doc.