dsa-ou / allowed

Check if a program only uses a subset of the Python language.
https://dsa-ou.github.io/allowed/
BSD 3-Clause "New" or "Revised" License
10 stars 6 forks source link

Windows UTF-8 encoding #47

Closed densnow closed 8 months ago

densnow commented 8 months ago

A student posted on the forums that they were getting the following UNICODE error when checking TMA02 with allowed on Windows resulting in the file not being checked.

UNICODE ERROR: 'charmap' codec can't decode byte 0x9d in position 96375: character maps to <undefined>

I was able to replicate the Error on Windows 10 running in Virtualbox using the same TMA02 .ipynb file.

After changing line 500 in allowed.py from

 with Path(filename).open() as file:

to

 with Path(filename).open(encoding='utf-8') as file:

the error was no longer present and the notebook was checked.

One can only assume that windows is opening the file with an encoding other than UTF-8 ?

densnow commented 8 months ago

From the Python3.10 interpreter running in Windows 10 Powershell:

>>> import sys
>>> import locale
>>> print(sys.getdefaultencoding())
utf-8
>>> print(locale.getpreferredencoding())
cp1252
mwermelinger commented 8 months ago

Thanks for this, but forcing a fixed encoding may lead to problems for other files. Can you try without encoding but instead errors='surrogateescape'? I got it from here.

If that fixes the issue, i.e. finds the same errors as with UTF-8 encoding, then please make a PR.

mwermelinger commented 8 months ago

I already created a branch for this.

densnow commented 8 months ago

I can confirm that on Windows 10 with the errors='surrogateescape' argument added to .open(), there were no UNICODE errors reported. This was using the same TMA02 .ipynb file that caused the original errors to be shown.

mwermelinger commented 8 months ago

Good to know that the Unicode errors go away, but does it report the same disallowed constructs as with utf-8? We don't want the surrogates to lead to a different list of reported violations...

densnow commented 8 months ago

@mwermelinger the last commit seems to have changed the way the dev environment is working and I cannot seem to test allowed in the same way.

When using the normal make install then make test command the dev environment does not seem to install ipython and pytype Is there a different way to set up the environment now?

One more thing to note is that pylint does not like the .open() method to be used without an explicit encoding. Its giving me the dreaded yellow squiggly line ! :smile:

mwermelinger commented 8 months ago

I haven't used make install in a while... Probably poetry install also needs [all] to install pytype and ipython.

densnow commented 8 months ago

Thanks, poetry does need extra arguments to set up dev environment now. I used poetry install --extras all which seemed to do the job (although I don't know if that is the best way). The tests are reporting no strange behaviors with the errors='surrogateescape' argument.

See unspecified encoding for details on the pylint warning.

mwermelinger commented 8 months ago

After reading pylint and PEP 597, I guess it is best to explicitly use utf-8 and surrogate any error characters, in case the original isn't utf-8. Thanks for the detective work! (And thanks to Steve for reporting the issue.)