Open jamadden opened 4 years ago
Thanks for opening a detailed issue. Cross linking to the less detailed jira ticket I opened based on the error email. https://nextthought.atlassian.net/browse/NTI-9724
This seems straightforward enough, however, from reading this I would assume that attempting something similar in alpha would get me an error, which it does not (e.g. by referencing a file from a lesson and attempting to export). Calling sys.getfilesystemencoding()
from within the python executable generated from the buildout on alpha also returned "UTF-8", which I didn't expect either. Unfortunately I can't reproduce this locally either, ostensibly since OSX uses "UTF-8" as it's default encoding. Is there something I'm missing, or some way that might allow me to reproduce this either locally or in alpha. Is it expected that we would only see this in prod?
What character string were you testing?
The file system encoding on Linux, and posix in general, is UTF8. That’s because paths are opaque sequences of bytes, the only special byte value is “/“. So the FS can store any string and by convention UTF8 is used. But by default python will encode Unicode using the ASCII encoding when it calls the byte-based open
API. So the FS doesn’t care, it’s the layer above blowing up.
Calling sys.getfilesystemencoding() from within the python executable generated from the buildout on alpha also returned "UTF-8", which I didn't expect either.
Hmm. I overlooked that part. I know that macOS used "utf-8" by default (because that actually is the native encoding of HFS+ and APFS; Darwin exposes kernel APIs that accept and expect UTF8, unlike Linux), but I thought I knew that generic POSIX systems used the regular default encoding.
And they do, according to Python/bltinmodule.c
:
/* The default encoding used by the platform file system APIs
Can remain NULL for all platforms that don't have such a concept
*/
#if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
const char *Py_FileSystemDefaultEncoding = "mbcs";
#elif defined(__APPLE__)
const char *Py_FileSystemDefaultEncoding = "utf-8";
#else
const char *Py_FileSystemDefaultEncoding = NULL; /* use default */
#endif
I checked this on two linux systems; on one, I got "UTF-8", but on the other, I got "ANSI_X3.4-1968". WTF?
There's no direct way to change that from user code; it's not exposed. But according to the documentation, it can sometimes be changed (emphasis mine):
Setting encoding to NULL causes the default encoding to be used which is ASCII. The file system calls should use Py_FileSystemDefaultEncoding as the encoding for file names. This variable should be treated as read-only: on some systems, it will be a pointer to a static string, on others, it will change at run-time (such as when the application invokes
setlocale
).
Ahh, there's the clue. We're a webapp, it's likely we're setting a default locale at some point. But I wasn't doing that manually. So where actually is Py_FileSystemDefaultEncoding
tinkered with?
Turns out the very act of starting the interpreter calls setlocale
and will set the filesystem encoding based on that:
/* On Unix, set the file system encoding according to the
user's preference, if the CODESET names a well-known
Python codec, and Py_FileSystemDefaultEncoding isn't
initialized by other means. Also set the encoding of
stdin and stdout if these are terminals, unless overridden. */
if (!overridden || !Py_FileSystemDefaultEncoding) {
...
setlocale(LC_CTYPE, "");
loc_codeset = nl_langinfo(CODESET);
…
/* Initialize Py_FileSystemDefaultEncoding from
locale even if PYTHONIOENCODING is set. */
if (!Py_FileSystemDefaultEncoding) {
Py_FileSystemDefaultEncoding = loc_codeset;
if (!overridden)
free_codeset = 0;
}
}
What value should we expect loc_codeset
to get after the line setlocale(LC_CTYPE, "");
? The answer is that it does something different depending on the system. On one system, the manpage says:
Only three locales are defined by default: the empty string "" (which denotes the native environment)
Where "the native environment" is never defined. On another system, it says:
If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent. For glibc, first (regardless of category), the environment variable
LC_ALL
is inspected, next the environment variable with the same name as the category (see the table above), and finally the environment variableLANG
. The first existing environment variable is used. If its value is not a valid locale specification, the locale is unchanged, andsetlocale()
returnsNULL
.
For the root of the problem then, it's likely that the environment variables are different between Alpha and the container environment experiencing this problem (or, the container environment doesn't have the locale specified by the variables installed, so it's not "a valid locale" and we're falling back to the C locale).
Note that Py_FileSystemEncoding
is a public, exported symbol, so any extension module could change it at any time. That plus the fact that it's highly system dependent, suggests that any call to 'open' should be prepared to handle this by doing its own encoding. (Ironically, the setlocale
exposed to Python doesn't actually change the file system encoding on either Python 2 or 3.)
Sadly, that's a bit complicated in the general case:
open
supports objects that implement the __fspath__
protocol (pathlib.Path
), and the behaviour of encoding unicode to bytes is more strictly specified and potentially depends on a number of different Python settings and the system.I think the fsencode
function (with fspath
) from gevent._compat
is a reasonably correct implementation of the desired behaviour. It won't be going away in any version of gevent that supports Python 2, so a reasonable thing to do is:
try:
from os import fsencode # Python 3
else: # Python 2
from gevent._compat import fsencode
…
# Typically unnecessary on Python 3, except for debugging or user presentation or persistence/transport
path = fsencode(path)
with open(path, …):
…
There's possible the basis of a blog post in here...
Thanks for the detailed analysis, that's extremely helpful and explains my confusion and inability to reproduce it. I'll work with Sean to see about getting the environments consistent and make the updates in our filer logic as well. I'll need to see about getting a test environment up to mimic what we're seeing in the container site.
I may have misunderstood how utilizing fsencode would help us. It sounds like (from surrogateescape description at https://docs.python.org/3/library/codecs.html) the surrogateescape
error handler used when encoding the target filename to the fs encoding will simply handle unicode characters that are already surrogates, i.e. were read from a byte stream and were unable to be properly decoded. This doesn't seem like it would help in attempting to store unicode characters outside the ascii range if the fs encoding is ascii. In addition, when attempting to use that error handler in Python 2, I seem to be getting the error LookupError: unknown error handler name 'surrogateescape'
. Looking at https://docs.python.org/2.7/library/codecs.html, I don't see it listed as a valid error handler. We could potentially use other error handlers specified there, but none would preserve the original character on a subsequent decode.
@jamadden does the above seem right to you? Or am I overlooking or misunderstanding something.
It shouldn’t matter because utf8 can encode everything, but I think that handler may be a third party package on python 2.
Seems like it would matter if we wanted to handle non-utf8 file systems. There seems to be some desire to have our container environments, or at least some environments, running using a default "C" locale. Based on what you've found, it seems like that'll lead to fs encodings of ascii and then we'll need to handle that, either by using character substitution (either from our own code or from another error handler) or maybe mapping internal to external filenames, at least in that case.
No, bytes are bytes, and that's all the filesystem cares about. UTF8 encoding will work fine no matter what sys.getfilesystemencoding
says or whether locale data files are installed or not.
fsencode
uses sys.getfilesystemencoding
to determine how to encode. To ensure utf8 encoding, we'd need to not use fsencode
and either pass bytes to transfer_to_native_file
or have it explicitly encode any non-bytes as utf8.
Yes. Sorry, I took your earlier message as ruling out fsencode
for this use case and had jumped to direct encoding in my head.
On Oct 28, 2020, at 14:38, Bobby Hagen notifications@github.com wrote:
fsencode uses sys.getfilesystemencoding to determine how to encode. To ensure utf8 encoding, we'd need to not use fsencode and either pass bytes to transfer_to_native_file or have it explicitly encode any non-bytes as utf8.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
The function
transfer_to_native_file
callsopen
on some arbitrary path (hopefully it's been through a bunch of sanitization before this!):https://github.com/NextThought/nti.dataserver/blob/f8b19159a6a9bdd41b7f7bf0edcc5cde09dbc092/src/nti/cabinet/filer.py#L52-L56
If the
target
path is unicode, it will be encoded using the default encoding when passed toopen
, which, unless we've messed withsite.py
, is going to be ASCII on Python 2. This leads to problems:On Python 2, the programmer is responsible for converting Unicode paths to something that can be represented by the filesystem. This should be done using
sys.getfilesystemencoding()
(if defined).For forward compatibility with Python 3, this should be done with a function called
fsencode
imported from a compatibility module.gevent has a tested cross-platform/cross-version at https://github.com/gevent/gevent/blob/c698c5843ee5c659b418762e2415da5a7d53f134/src/gevent/_compat.py#L150-L174