to_bytes() or not to_bytes()

dineshbvadhia commented 9 years ago

fout = codecs.open(os.path.join('data'), "w", "utf-8")

with codecs.open(fin, 'r') as f:
    for line in f:
        line = to_unicode(item, 'utf-8').strip()
        line = to_bytes(line)
        fout.(line + "\n")

Occasionally, generates a UnicodeDecodeError.

with codecs.open(fin, 'r') as f:
    for line in f:
        line = to_unicode(item, 'utf-8').strip()
        fout.(line + "\n")

No errors without the to_bytes().

Not sure if this a problem or not?

ralphbean commented 9 years ago

Can you share the input file?

dineshbvadhia commented 9 years ago

I'm using the TRC2 dataset (http://trec.nist.gov/data/reuters/reuters.html) and the agreement doesn't allow me to post parts of the data.

I've found a lot of instances of the to_bytes() problem in this dataset which makes me suspect that it contains not only different encodings but also different encodings per line. It could be a good dataset to test kitchen against.

dineshbvadhia commented 9 years ago

This is code that generates the Traceback error:

fout = codecs.open(os.path.join('data'), "w", "utf-8")

content = " ".join(" ".join(tokens[2:]).strip('"').split()) print(type(content)) print(type(to_bytes(content)))

fout.write(to_bytes(dataid + " " + item_date + " " + item_time + " " + heading + " " + content + "\n"))

Traceback (most recent call last): ... File "x_trc2.py", line 188, in start_part fout.write(to_bytes(dataid + " " + item_date + " " + item_time + " " + heading + " " + content + "\n")) File "C:\Anaconda\lib\codecs.py", line 688, in write return self.writer.write(data) File "C:\Anaconda\lib\codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 692: ordinal not in range(128)

The to_bytes(content) works but not writing the to_bytes(content + other strings) to a file.

abadger commented 6 years ago

My apologies that no one was able to give you answers when you first posted this.

tldr; The code you posted is using codecs.open() but passing bytes to the filehandle's write() method. This is incorrect usage of the codecs.open() API. Pass in text type (Python2's unicode type or Python3's str type) and this code should work.

The filesystem uses bytes so anytime Python accesses the filesystem, it has to either read or write bytes. codecs.open() (And also, the Python3 open() call) essentially add a layer on top of this which translates between bytes at the filesystem end and text at the caller's end. So you are supposed to feed text type into a file that is opened using codecs.open(), not bytes. This revised code should work:

fout = codecs.open(os.path.join('data'), "w", "utf-8")

content = " ".join(" ".join(tokens[2:]).strip('"').split())
print(type(content))
print(type(to_text(dataid + " " + item_date + " " + item_time + " " + heading + " " + content + "\n"))

fout.write(to_text(dataid + " " + item_date + " " + item_time + " " + heading + " " + content + "\n"))

Note that in this case, if content was known to always contain text type, then the call to to_text() would be superfluous.

You can also use https://pythonhosted.org/kitchen/api-text-converters.html#kitchen.text.converters.getwriter instead of calling codecs.open() if you want to wrap a filehandle with an encoder that will handle both text and bytes and will not traceback on data invalid in the encoding:


from kitchen.text.converters import getwriter

# wb so this works on both Python2 and Python3
fout = open('/var/tmp/data', "wb")

UTF8Writer = getwriter(fout)
fout_writer = UTF8Writer(fout)

content = u'café\n'
print(type(content))
print(type(to_bytes(content)))
print(type(to_text(content)))

fout_writer.write(to_text(content))
fout_writer.write(to_bytes(content))
fout_writer.write(b'\xff\n')

# cat data:
# café
# café
# �

fedora-infra / kitchen

to_bytes() or not to_bytes() #9