image position/extend addressing mode and metrics

For placing and sizing an image there are different ways/metrics possible.

preliminary considerations

As prior art we only have the SIXEL way, plus the newer SIXEL implementations (somewhat broken in this regard).

The SIXEL way:

initial position is determined either by origin of active text cell (SIXEL_SCROLL=on) or as origin of viewport (SIXEL_SCROLL=off)
image comes with size in pixels
a pixel maps to a certain screen extend based on inch (has similarities to pt in CSS)
screen extend setting is provided in every sequence
text cells itself are meant to be sized in that screen extend as a fixed width/height ratio

Newer SIXEL implementations ignore these settings and instead map pixels 1:1 to screen pixels. While this enables pixel perfect output, it makes autoscaling based on certain TE settings impossible.

To not end up in the same trap, we need to find a good metrics abstraction, that tries to respect:

preserve text-image proportions across different font sizes (cell coverage must not change)
preserve inner image proportions (a drawn circle may not reshape into an ellipsis), this is actually a hard one for font size changes, that dont scale perfectly proportional in 1:2 and prolly needs a keepAspectRatio on the sequence.
sequence may not change due to changed font size in TE
How about subcell addressing?

In a TE we basically have these dimensions given:

viewport extend in number of cells (COLS x ROWS)
final TE with output representation: cell size in screen pixels (close to font-size/2 : font-size)
multiplexer: no predefined cell size (not yet?)

means we have only 2 reliable metrics we can count on - a cell as base unit or the viewport extend in terms of number of cells.

Proposal 1: relative positioning and sizing to cell size

A naive way to deal with the limited entry metrics we have, is to derive all positioning and extend calculations only on discrete cells:

origin: at current active cell (top-left corner) or given by some xy values denoting cells
extend: number of cells in xy direction (bottom-right corner)
image scales in either direction to fit (plus a keepAspectRatio flag and align modes for images that should preserve image proportions)
no explicit image pixel to screen pixel relation (gets implicit calculated by the extend and alignment settings)
easy to comprehend for app devs - everything gets scaled in terms of text cells
:heavy_check_mark: preserves cell coverage
:heavy_check_mark: preserves image proportions with keepAspectRatio flag
:heavy_check_mark: sequence is stable across font size changes
:x: no direct subcell addressing possible (still can be faked by offsetting the pixels in the image, but unreliable due to unknown cell extend in the first place)

To also allow subcell addressing, this approach could be extended in a way to also address fractions of a cell. If done as float values, the size notions will behave pretty much like rem in CSS, with the difference of having distinct bases for width (1 = one text cell width) and height (1 = one text cell height). But not sure yet, if we really need subcell addressing at all.

Proposal 2: absolute positioning and pixel sizing in terms of a custom unit

Implement something similar to SIXEL's way with units and explicit pixel sizing:

image pixel: some size in terms of units
text cell size: some size in terms of units
origin: at current active cell (upper-left corner) or given by some xy values denoting unit offset from top-left of viewport
extend: not explicitly set (uses image pixels instead)
changing font size: image pixels are resized in terms of units, keeping cell size stable
again a keepAspectRatio flag and align modes can indicate, whether to keep image proportions.
:heavy_check_mark: preserves cell coverage (given if cell size is kept stable)
:heavy_check_mark: preserves image proportions with keepAspectRatio flag
:heavy_check_mark: sequence is stable across font size changes
:heavy_check_mark: allows stable direct subcell addressing, since we have a virtual cell size defined prehand

Pros & Cons

ease of implementation On a first glance, 1.) seems to be much easier to implement, even with float numbering. Ofc floats cannot be expressed in sequence params directly, thus would need some thinking/shaping (maybe with sub params like ... ; 30 : 2 ; ... for 30,2). Proposal 2.) seems slightly more complicated to implement, as it needs an additional indirection over the virtual unit. Same with keepAspectRatio - it needs an additional correction calculation to level stretching artefacts from the base pixel size.
ease of usage Imho proposal 1.) is easier to grasp by app devs, as the basic metrics is in terms of text cells. Furthermore 1.) does not need any image transformation at all if we always expect image extend values on the sequence. For 2.) the image has to be converted into new pixel dimensions to fit the right extend on the TE. If the text cell size in units shall be configurable, the app furthermore needs to know that size prehand.
flexibility Proposal 2.) is more flexible, as it would allow to redefine text cell and image pixel size in units. This additional indirection make perfect up/down scaling much easier. But it still shares the same problems as 1.) when font size changes dont scale perfectly in a stable ratio.
multiplexer Both proposals should work with multiplexers. 1.) has the disadvantage to treat an image as a whole without knowing, how image pixels map to cells, while 2.) would allow to slice parts of the image (certain pixel areas).

There might be other proposals or variants of the two above, that make more sense in the end. Plz lets discuss the details (sorry for so much texting) :smile_cat:

For the record, the conclusion I reached on the subject of fitting images to cells (at least in the context of Sixel) was to just leave it up to the user to decide. If a user needs to emulate legacy software designed for the VT340, they can select a 10x20 cell size, and images will automatically scale to fit cells of that size (i.e. a 100x200 image will always occupy exactly 10x10 text cells, regardless of the actual font size). Similarly, for a VT382, they can select a 12x30 cell size. If they're using modern applications that query the cell size, and they want pixel-perfect images, then they can also just set it to "automatic", and the emulated cell size will match the real font size.

Once the user has that control, I don't see the need to allow apps to intervene. For example if I've configured a cell size of 10x20, and I'm happy with images being scaled to fit that size, I'm not sure I want applications overriding that and sending me high resolution images that chew up my bandwidth, just because my real font size is actually something like 18x36. As long as the terminal is reporting the emulated cell size in the CSI 16 t queries (and similarly scaling the response for CSI 14 t queries), then the app shouldn't care. Maybe it's not pixel perfect, and the aspect ratio isn't exactly right, but that's my choice to make.

Also, from an implementation point of view, I didn't think it was practical to allow applications to change the cell size on the fly, because then you need to track the pixel scale associated with every image segment on the screen. And since images can overlap with transparency, you've potentially got weird overlapping shapes with each portion needing to be rendered with different scaling factors. Maybe that's feasible with the right architecture, but it sounded like a nightmare to me, and I really couldn't see a compelling need for it.

@j4james Thx for your thoughts. Hopefully I find more time to address ideas more in detail again once the monster PR got merged. If you are interested we also have some layering discussions started for a new image protocol in #11.

I've been avoiding the image protocol discussions, because I'm not really a fan of the idea. I know Sixel isn't perfect, but I find it's good enough for my needs. And in general I'm more interested in implementing existing protocols than inventing new ones. Not that I want to discourage you guys from going that route, but it's not something I want to get involved in.

I was just chiming in here because you mentioned it in that other thread, and I thought the experience I've been through recently with Sixel scaling was somewhat relevant to this topic. If nothing else, I want to stress how important it can be to actually attempt to implement the ideas you're proposing, because it often isn't as easy as you first imagine (that was certainly the case for me).

Ultimately the success of a new protocol like this will depend on how many people you can persuade to implement it. And many will already have existing image storage and rendering architectures with which that they'll need to integrate it. If it's not reasonably straightforward to get working, you may struggle to get it widely adopted.

@j4james We both know that we have somewhat different viewpoints on sixel as such. I second you in one point - it is good enough for many things ppl want to do with pixel graphics in terminals.

My main driving force for terminal graphics is the REPL idea - some cmdline repl wants to spit out a visualisation of data, so just cat some format the TE understands, and the user is happy. Thats a very cheap goal, and indeed sixel can do that just fine (even interactively, if the app cares enough, by diff overwriting with blending). No need for high color space, tricky alpha stuff - just give me some visualisation feedback to make sense of the data. The REPL case is the reason why I left my "SIXEL should have died 30ys ago" stance. Yes, its good enough for that. (Sidenote - other than SIXEL I think ReGIS really is dead, and should be replaced by some SVG-T thingy. :smile_cat:)

But the needs dont stop there - the "industry" moved on with better image formats, which creates demand. Thats the point where I think that the terminal world should evolve as well. The least what could be done is to establish some alternative data payload (read it as "allow PNG/JPG transports similiar to the SIXEL sequence"). But we got SIXEL wrong in the second place (mostly because of the lack of correct docs or access to test devices), so we somewhat have to re-iterate the basics again. Which involves nasty things like the pixel to col/row translation. In the meantime other sequences popped up, which have fundamental flaws regarding basic terminal interface principles. Ideally we manage to avoid those flaws (like pixel notions from sequence params). Furthermore reshaping things from ground zero gives us the chance to enhance/extend the capabilities, like the layering discussion tries to figure out, what might be wanted.

Ofc this process is very tedious, and there are no guarantees, that ppl will adopt things in the end. Regarding an actual implementation - most things discussed here so far are at least backed up on my side with playground implementations, I am pretty aware of what is doable and what should be avoided as anti pattern. But I am not an egomaniac, that tries to push his own ideas, without seeking for some consensus first. We have at least a small group of TE devs, that share most ideas, which gives at least me the impression, that things will go into the right direction.

contour-terminal / terminal-good-image-protocol

image position/extend addressing mode and metrics #13