benhoyt / scandir

Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib
https://benhoyt.com/writings/scandir/
BSD 3-Clause "New" or "Revised" License
532 stars 68 forks source link

On Ubuntu 16.04 LTS, scandir behaviour changes when called via subprocess.Popen #99

Closed relevitt closed 3 years ago

relevitt commented 6 years ago

I don't know if this would be considered a scandir bug, but it resulted in a bug in my application, which took a long time to figure out.

After migrating my app from python2 to python3 (version = 3.5.2), I was getting errors when trying to save paths returned by scandir into an sqlite database. The errors resulted from scandir returning paths with surrogate escape codes in them.

Having read up on surrogate escape codes, I was still baffled why there hadn't been some equivalent manifestation of the problem when I'd been using python2. I started running scandir tests on the problem directories from python3 running in a shell. However, when I ran scandir in this way, no surrogate escape codes were used.

I was even more baffled by the difference in behaviour with scandir running in my application and scandir running under python in a shell.

Eventually, I realised the problem lay in the fact that, in my application, scandir is running from a subprocess (scandir is used to scan 1000s of directories, so a subprocess is used to prevent the main app blocking) which is executed using subprocess.Popen, with a minimal env variable.

Once I ran subprocess.Popen with env=os.environ (so the subprocess has the same environment as the parent process, which is launched from a shell), scandir stopped using surrogate escape codes. I don't know which of the many variables in os.environ made the difference.

avylove commented 6 years ago

Can you compare your LANG/LC_ environment variables between the parent process and a subprocess that isn't passed 'env'? My guess is you're running unicode in parent and not in the child, so the child is using surrogate escape codes for anything outside of standard ascii.

benhoyt commented 5 years ago

@relevitt Did you ever figure out the root cause here?

benhoyt commented 3 years ago

Going to close this issue due to lack of a response. Feel free to re-open if there's still a reproduce-able issue.