more robust parser - Githubissues

doy commented 10 years ago

Right now, I parse chunks at a time, which will break if an escape sequence is split across chunks, among many other things. I really need to rewrite this to use a char-at-a-time state machine, or something along those lines.

doy commented 10 years ago

The parser has been converted to use flex, but it still has the problem of splitting escape sequences across reads. I'll need to move the yylex call into the libuv work queue function directly, and have that be the blocking function rather than the read.

doy commented 10 years ago

So that won't actually work, because if you read large chunks at once, flex won't be able to tell if it should keep matching or not (if it receives a string of text, it has no way of knowing if it should just return it or try reading again, which may block), and if you read single bytes at a time, it will break utf8. What we actually need to do here is parse out of a pre-read string like we were before, but don't include fallbacks for things like warning about incomplete escape sequences, and look at yyleng after yylex returns to see how much of the string we actually parsed, and if it wasn't the whole string, then push the characters back into a buffer to try parsing next time we have things to read.

doy commented 10 years ago

This still leaves the issue of utf8 characters being split across reads though - this might be solvable just by having the parser be aware of utf8 and only reading full characters. I'm not sure how cairo (or pango or whatever) will handle being handed a codepoint for a character followed by a codepoint for a combining character in a second pass - not sure if there is actually anything at all we can do about that, though, since the initial code point is a valid printable thing on its own, and there's no way of telling if a combining character will be coming next if it's not already in the buffer.

doy commented 10 years ago

Codepoints and escape sequences are now handled properly when split across reads. Still not sure what to do about glyph clusters split across reads.

doy commented 10 years ago

It looks like urxvt just applies the combining character to whatever character is to the left of the cursor, so this will probably be something that I need to wait on #45 for.

doy commented 10 years ago

Actually, the parser itself is fine at this point - the combining character issue will be handled elsewhere.

doy / runes

more robust parser #4