nti.cabinet.filer: Issues with unicode paths

jamadden commented 4 years ago

The function transfer_to_native_file calls open on some arbitrary path (hopefully it's been through a bunch of sanitization before this!):

https://github.com/NextThought/nti.dataserver/blob/f8b19159a6a9bdd41b7f7bf0edcc5cde09dbc092/src/nti/cabinet/filer.py#L52-L56

If the target path is unicode, it will be encoded using the default encoding when passed to open, which, unless we've messed with site.py, is going to be ASCII on Python 2. This leads to problems:

Module nti.app.contenttypes.presentation.exporter:272 in _do_export
>>  self._post_process_asset(lesson, ext_obj, filer, backup, salt)
(u'tag:nextthought.com,2011-10:NTI-NTILessonOverview-8647195755405914330_4744460999261044781_uniburma_4744466038042153447_0_uniburma_4744466038144278320_0', u'tag:nextthought.com,2011-10:NTI-NTILessonOverview-8647195755405914330_4744460999261044781_uniburma_4744466038042153447_0_uniburma_4744466038144278320_0')

Module nti.app.contenttypes.presentation.exporter:199 in _post_process_asset
>>  salt=salt)
(u'tag:nextthought.com,2011-10:NTI-NTICourseOverviewGroup-system_20200701081229_111204_153551565', u'tag:nextthought.com,2011-10:NTI-NTICourseOverviewGroup-system_20200701081229_111204_153551565')

Module nti.app.contenttypes.presentation.exporter:199 in _post_process_asset
>>  salt=salt)
(u'tag:nextthought.com,2011-10:NTI-NTIRelatedWorkRefPointer-system_20200701081304_124319_950500686', u'tag:nextthought.com,2011-10:NTI-NTIRelatedWorkRef-system_20200701081304_069734_1193179955')

Module nti.app.contenttypes.presentation.exporter:233 in _post_process_asset
>>  save_resources_to_filer(provided, concrete, filer, ext_obj)
Module nti.app.products.courseware.utils.exporter:99 in save_resources_to_filer
>>  internal = save_resource_to_filer(value, filer, True, obj)
Module nti.app.products.courseware.utils.exporter:78 in save_resource_to_filer
>>  contentType=contentType)
Module nti.cabinet.filer:118 in save
>>  transfer_to_native_file(source, out_file)
Module nti.cabinet.filer:53 in transfer_to_native_file
>>  with open(target, "wb") as fp:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 39: ordinal not in range(128)

On Python 2, the programmer is responsible for converting Unicode paths to something that can be represented by the filesystem. This should be done using sys.getfilesystemencoding() (if defined).

For forward compatibility with Python 3, this should be done with a function called fsencode imported from a compatibility module.

gevent has a tested cross-platform/cross-version at https://github.com/gevent/gevent/blob/c698c5843ee5c659b418762e2415da5a7d53f134/src/gevent/_compat.py#L150-L174

cutz commented 4 years ago

Thanks for opening a detailed issue. Cross linking to the less detailed jira ticket I opened based on the error email. https://nextthought.atlassian.net/browse/NTI-9724

hagenrd commented 4 years ago

This seems straightforward enough, however, from reading this I would assume that attempting something similar in alpha would get me an error, which it does not (e.g. by referencing a file from a lesson and attempting to export). Calling sys.getfilesystemencoding() from within the python executable generated from the buildout on alpha also returned "UTF-8", which I didn't expect either. Unfortunately I can't reproduce this locally either, ostensibly since OSX uses "UTF-8" as it's default encoding. Is there something I'm missing, or some way that might allow me to reproduce this either locally or in alpha. Is it expected that we would only see this in prod?

jamadden commented 4 years ago

What character string were you testing?

jamadden commented 4 years ago

The file system encoding on Linux, and posix in general, is UTF8. That’s because paths are opaque sequences of bytes, the only special byte value is “/“. So the FS can store any string and by convention UTF8 is used. But by default python will encode Unicode using the ASCII encoding when it calls the byte-based open API. So the FS doesn’t care, it’s the layer above blowing up.

jamadden commented 4 years ago

Calling sys.getfilesystemencoding() from within the python executable generated from the buildout on alpha also returned "UTF-8", which I didn't expect either.

Hmm. I overlooked that part. I know that macOS used "utf-8" by default (because that actually is the native encoding of HFS+ and APFS; Darwin exposes kernel APIs that accept and expect UTF8, unlike Linux), but I thought I knew that generic POSIX systems used the regular default encoding.

And they do, according to Python/bltinmodule.c:

/* The default encoding used by the platform file system APIs
   Can remain NULL for all platforms that don't have such a concept
*/
#if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
const char *Py_FileSystemDefaultEncoding = "mbcs";
#elif defined(__APPLE__)
const char *Py_FileSystemDefaultEncoding = "utf-8";
#else
const char *Py_FileSystemDefaultEncoding = NULL; /* use default */
#endif

I checked this on two linux systems; on one, I got "UTF-8", but on the other, I got "ANSI_X3.4-1968". WTF?

There's no direct way to change that from user code; it's not exposed. But according to the documentation, it can sometimes be changed (emphasis mine):

Setting encoding to NULL causes the default encoding to be used which is ASCII. The file system calls should use Py_FileSystemDefaultEncoding as the encoding for file names. This variable should be treated as read-only: on some systems, it will be a pointer to a static string, on others, it will change at run-time (such as when the application invokes setlocale).

Ahh, there's the clue. We're a webapp, it's likely we're setting a default locale at some point. But I wasn't doing that manually. So where actually is Py_FileSystemDefaultEncoding tinkered with?

Turns out the very act of starting the interpreter calls setlocale and will set the filesystem encoding based on that:

    /* On Unix, set the file system encoding according to the
       user's preference, if the CODESET names a well-known
       Python codec, and Py_FileSystemDefaultEncoding isn't
       initialized by other means. Also set the encoding of
       stdin and stdout if these are terminals, unless overridden.  */

    if (!overridden || !Py_FileSystemDefaultEncoding) {
...
        setlocale(LC_CTYPE, "");
        loc_codeset = nl_langinfo(CODESET);
   …
        /* Initialize Py_FileSystemDefaultEncoding from
           locale even if PYTHONIOENCODING is set. */
        if (!Py_FileSystemDefaultEncoding) {
            Py_FileSystemDefaultEncoding = loc_codeset;
            if (!overridden)
                free_codeset = 0;
        }
    }

What value should we expect loc_codeset to get after the line setlocale(LC_CTYPE, "");? The answer is that it does something different depending on the system. On one system, the manpage says:

Only three locales are defined by default: the empty string "" (which denotes the native environment)

Where "the native environment" is never defined. On another system, it says:

If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent. For glibc, first (regardless of category), the environment variable LC_ALL is inspected, next the environment variable with the same name as the category (see the table above), and finally the environment variable LANG. The first existing environment variable is used. If its value is not a valid locale specification, the locale is unchanged, and setlocale() returns NULL.

For the root of the problem then, it's likely that the environment variables are different between Alpha and the container environment experiencing this problem (or, the container environment doesn't have the locale specified by the variables installed, so it's not "a valid locale" and we're falling back to the C locale).

Note that Py_FileSystemEncoding is a public, exported symbol, so any extension module could change it at any time. That plus the fact that it's highly system dependent, suggests that any call to 'open' should be prepared to handle this by doing its own encoding. (Ironically, the setlocale exposed to Python doesn't actually change the file system encoding on either Python 2 or 3.)

Sadly, that's a bit complicated in the general case:

On Python 3, you don't want to do your own encoding. open supports objects that implement the __fspath__ protocol (pathlib.Path), and the behaviour of encoding unicode to bytes is more strictly specified and potentially depends on a number of different Python settings and the system.
On Python 2, Windows only correctly supports the MBCS encoding.
The error handler should be 'surrogateescape' for UTF-8 encodings not on Windows.

I think the fsencode function (with fspath) from gevent._compat is a reasonably correct implementation of the desired behaviour. It won't be going away in any version of gevent that supports Python 2, so a reasonable thing to do is:

try:
    from os import fsencode # Python 3
else: # Python 2
    from gevent._compat import fsencode

…
# Typically unnecessary on Python 3, except for debugging or user presentation or persistence/transport
path = fsencode(path)
with open(path, …):
    …

There's possible the basis of a blog post in here...

hagenrd commented 4 years ago

Thanks for the detailed analysis, that's extremely helpful and explains my confusion and inability to reproduce it. I'll work with Sean to see about getting the environments consistent and make the updates in our filer logic as well. I'll need to see about getting a test environment up to mimic what we're seeing in the container site.

hagenrd commented 3 years ago

I may have misunderstood how utilizing fsencode would help us. It sounds like (from surrogateescape description at https://docs.python.org/3/library/codecs.html) the surrogateescape error handler used when encoding the target filename to the fs encoding will simply handle unicode characters that are already surrogates, i.e. were read from a byte stream and were unable to be properly decoded. This doesn't seem like it would help in attempting to store unicode characters outside the ascii range if the fs encoding is ascii. In addition, when attempting to use that error handler in Python 2, I seem to be getting the error LookupError: unknown error handler name 'surrogateescape'. Looking at https://docs.python.org/2.7/library/codecs.html, I don't see it listed as a valid error handler. We could potentially use other error handlers specified there, but none would preserve the original character on a subsequent decode.

hagenrd commented 3 years ago

@jamadden does the above seem right to you? Or am I overlooking or misunderstanding something.

jamadden commented 3 years ago

It shouldn’t matter because utf8 can encode everything, but I think that handler may be a third party package on python 2.

hagenrd commented 3 years ago

Seems like it would matter if we wanted to handle non-utf8 file systems. There seems to be some desire to have our container environments, or at least some environments, running using a default "C" locale. Based on what you've found, it seems like that'll lead to fs encodings of ascii and then we'll need to handle that, either by using character substitution (either from our own code or from another error handler) or maybe mapping internal to external filenames, at least in that case.

jamadden commented 3 years ago

No, bytes are bytes, and that's all the filesystem cares about. UTF8 encoding will work fine no matter what sys.getfilesystemencoding says or whether locale data files are installed or not.

hagenrd commented 3 years ago

fsencode uses sys.getfilesystemencoding to determine how to encode. To ensure utf8 encoding, we'd need to not use fsencode and either pass bytes to transfer_to_native_file or have it explicitly encode any non-bytes as utf8.

jamadden commented 3 years ago

Yes. Sorry, I took your earlier message as ruling out fsencode for this use case and had jumped to direct encoding in my head.

On Oct 28, 2020, at 14:38, Bobby Hagen notifications@github.com wrote:

fsencode uses sys.getfilesystemencoding to determine how to encode. To ensure utf8 encoding, we'd need to not use fsencode and either pass bytes to transfer_to_native_file or have it explicitly encode any non-bytes as utf8.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

OpenNTI / nti.dataserver

nti.cabinet.filer: Issues with unicode paths #388