I started making this python 3 compat since @lcombs was on a 3.x anaconda distribution. Right now the changes are
print statements->print function lolol
File opening is now totally different in python 3.x series. In this PR any open() function calls are called with 'r' or 'w' mode instead of 'rb' or 'wb'. In python 3.x, 'rb' and 'wb' now explicitly mean read as a BytesIO object, while 'r' (implicitly 'rt') and 'w' (implicitly 'wt)) mean read as a TextIO object (which means it does the str conversion for you on read). In python 2.x series, the open() function instantiated a file object, whose read method implicitly decoded the underlying bytestream to str. So the TextIO behavior is more similar to what we used to do. If we want, we can read them in a BytesIO and decode them ourselves, but given that python 3.x series default encoding is now UTF-8 I'm don't think that's necessary. So actually instead I now converted all of the open() calls to explicitly use io.open() (which is the default for python 3.x series for open()) so that both styles use the same interface. I explicitly load as binary or text at each call; most importantly during the generate_folds script, I load as binary and force downconvert to ascii encoding as we normally would do later during the clean_corpus steps.
Surprisingly since 'rb' and 'wb' mode in python 2.x series probably didn't do anything useful for us given that we called the .read() method anyways, this may still be 2.x compatible.
I started making this python 3 compat since @lcombs was on a 3.x anaconda distribution. Right now the changes are
In this PR anySo actually instead I now converted all of theopen()
function calls are called with 'r' or 'w' mode instead of 'rb' or 'wb'. In python 3.x, 'rb' and 'wb' now explicitly mean read as aBytesIO
object, while 'r' (implicitly 'rt') and 'w' (implicitly 'wt)) mean read as aTextIO
object (which means it does thestr
conversion for you on read). In python 2.x series, theopen()
function instantiated afile
object, whose read method implicitly decoded the underlying bytestream tostr
. So the TextIO behavior is more similar to what we used to do. If we want, we can read them in aBytesIO
and decode them ourselves, but given that python 3.x series default encoding is now UTF-8 I'm don't think that's necessary.open()
calls to explicitly useio.open()
(which is the default for python 3.x series foropen()
) so that both styles use the same interface. I explicitly load as binary or text at each call; most importantly during the generate_folds script, I load as binary and force downconvert to ascii encoding as we normally would do later during the clean_corpus steps.Surprisingly since 'rb' and 'wb' mode in python 2.x series probably didn't do anything useful for us given that we called the.read()
method anyways, this may still be 2.x compatible.