This PR addresses an issue where any emoji symbols in the input lua script would be replaced by a garbled sequence of characters. The proposed solution is to remove picotool's current handling of P8SCII escape sequences which does not seem to function as intended.
The Details
I encountered an issue where any use of the π ΎοΈ emoji in my lua script would be replaced by "γ¦γβ½γγ€γΎβ" after building a .p8 cart with picotool. The cause of this issue seems to stem from P8SCII being treated as an encoding in itself. In practice, this treatment boils down to two steps:
When parsing a string literal, the lexer replaces any numerical P8SCII escape sequence it encounters with a byte of the specified value, seemingly hoping that this results in a "pure" P8SCII string.
Later, the P8 formatter calls lua.p8scii_to_unicode, which seems meant to convert all P8SCII characters in the passed string to their utf-8 counterparts. The formatter assumes, at this point, that the lua script is P8SCII encoded. As a side note, this substitution routine runs on the entire script and not just on the string tokens that had their escape sequences converted by the lexer in step 1.
Both of the above steps have inherent issues:
Replacing P8SCII escape sequences with their corresponding byte values does not turn the input string in its entirety into a P8SCII encoded string as the majority of the string retains its original encoding (utf-8). What we end up with instead is a mix of utf-8 and P8SCII.
The assumption that the passed string is P8SCII encoded is incorrect. It is, In fact, mostly utf-8 with a few dashes of P8SCII encoded characters as a result of step 1. When this conversion routine encounters the seven byte long utf-8 character for π ΎοΈ, it will replace each of the seven bytes with a new utf-8 character, resulting in "γ¦γβ½γγ€γΎβ".
Future Improvements
I would argue against treating P8SCII as a text encoding, instead merely treating it as a collection of escape sequences that hold a special meaning when passed to Pico-8's print function and passing them through unchanged. If pre-interpreting these escape sequences is still a desired feature, I'd suggest it be done in one go when parsing or writing the string tokens instead of passing through an intermediate format.
There are probably additional code paths or data structures that are made dead by this change and could be removed.
TL;DR
This PR addresses an issue where any emoji symbols in the input lua script would be replaced by a garbled sequence of characters. The proposed solution is to remove picotool's current handling of P8SCII escape sequences which does not seem to function as intended.
The Details
I encountered an issue where any use of the π ΎοΈ emoji in my lua script would be replaced by "γ¦γβ½γγ€γΎβ" after building a .p8 cart with picotool. The cause of this issue seems to stem from P8SCII being treated as an encoding in itself. In practice, this treatment boils down to two steps:
lua.p8scii_to_unicode
, which seems meant to convert all P8SCII characters in the passed string to their utf-8 counterparts. The formatter assumes, at this point, that the lua script is P8SCII encoded. As a side note, this substitution routine runs on the entire script and not just on the string tokens that had their escape sequences converted by the lexer in step 1.Both of the above steps have inherent issues:
Future Improvements
print
function and passing them through unchanged. If pre-interpreting these escape sequences is still a desired feature, I'd suggest it be done in one go when parsing or writing the string tokens instead of passing through an intermediate format.