clemg / pythongolfer

Code golfer and minifier for Python https://clemg.github.io/pythongolfer/
https://clemg.github.io/pythongolfer/
GNU General Public License v3.0
33 stars 5 forks source link

Some non-ASCII input mangled silently #4

Open escalonn opened 2 years ago

escalonn commented 2 years ago

Due to a broken check in the codegolf function, non-Latin-1 characters (all those above U+00FF) at odd-numbered positions in the input string have their code point silently truncated to 8 bytes, instead of throwing an error so the user can be notified.

To Reproduce

  1. Enter 'ज़' into input box, which has a non-Latin-1 character in index 1.
  2. Click on "Golf it" button
  3. Observe printed output exec(bytes('嬧‧','u16')[2:])
  4. Verify that bytes('嬧‧','u16')[2:] evaluates to b"'[' ", which does not match the input code.

Expected behavior Error message displayed about non-ASCII characters, as it is for the input ' ज़' (space added to put the character into an even-numbered position).

Environment

Additional context The code causing the issue is here Effectively c1 (the even-numbered character) is checked but c2 is ignored and subsequently truncated.

Also Handling of characters from the Latin-1 Supplement block (U+0080 to U+00FF) by this site is unclear. These are non-ASCII characters, but is there a reason to ban them from the input? Shouldn't the check really be > 255 instead of > 127?

clemg commented 2 years ago

Good catch! This is indeed a problem.

Would you like to submit a PR?