dansanderson / picotool

Tools and Python libraries for manipulating Pico-8 game files. http://www.lexaloffle.com/pico-8.php
MIT License
371 stars 45 forks source link

Disable P8SCII unescaping to fix mangling of emoji characters #106

Open simonwulf opened 2 years ago

simonwulf commented 2 years ago

TL;DR

This PR addresses an issue where any emoji symbols in the input lua script would be replaced by a garbled sequence of characters. The proposed solution is to remove picotool's current handling of P8SCII escape sequences which does not seem to function as intended.

The Details

I encountered an issue where any use of the πŸ…ΎοΈ emoji in my lua script would be replaced by "γƒ¦γ‹βœ½γ‚†γƒ€γΎβ—†" after building a .p8 cart with picotool. The cause of this issue seems to stem from P8SCII being treated as an encoding in itself. In practice, this treatment boils down to two steps:

  1. When parsing a string literal, the lexer replaces any numerical P8SCII escape sequence it encounters with a byte of the specified value, seemingly hoping that this results in a "pure" P8SCII string.
  2. Later, the P8 formatter calls lua.p8scii_to_unicode, which seems meant to convert all P8SCII characters in the passed string to their utf-8 counterparts. The formatter assumes, at this point, that the lua script is P8SCII encoded. As a side note, this substitution routine runs on the entire script and not just on the string tokens that had their escape sequences converted by the lexer in step 1.

Both of the above steps have inherent issues:

  1. Replacing P8SCII escape sequences with their corresponding byte values does not turn the input string in its entirety into a P8SCII encoded string as the majority of the string retains its original encoding (utf-8). What we end up with instead is a mix of utf-8 and P8SCII.
  2. The assumption that the passed string is P8SCII encoded is incorrect. It is, In fact, mostly utf-8 with a few dashes of P8SCII encoded characters as a result of step 1. When this conversion routine encounters the seven byte long utf-8 character for πŸ…ΎοΈ, it will replace each of the seven bytes with a new utf-8 character, resulting in "γƒ¦γ‹βœ½γ‚†γƒ€γΎβ—†".

Future Improvements