Turtle Files with a UTF-8 BOM fail to parse

Letractively / rdflib

Automatically exported from code.google.com/p/rdflib

Other

0 stars 0 forks source link

Turtle Files with a UTF-8 BOM fail to parse #156

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Try to parse a Turtle file which has a UTF-8 BOM at the start

What is the expected output? What do you see instead?

File should parse without issue but instead parsing fails - am reporting this 
issue on behalf of one of my users so haven't reproduced this myself

UTF-8 BOMs are not required but they are permitted so the parser is rejecting 
perfectly valid files.

This issue will likely affect all UTF-8 formats (Notation 3, TriG, SPARQL Query 
etc)

Original issue reported on code.google.com by rve...@gmail.com on 4 Jan 2011 at 1:31

Attachments:

ttl-with-bom.ttl

GoogleCodeExporter commented 9 years ago


Confirmed this bug, and have a fix:

in plugins/parsers/notation3.py, line 910 and following:

    if not isinstance(octets, unicode):
       str = octets.decode('utf-8')
       # NB Already decoded,so \ufeff                                             
       if str[0] == codecs.BOM_UTF8.decode('utf-8'):
          str = str[1:]
    else:
       str=octets

And import codecs at the top.

After this change, the test file submitted by rvesse above works:

f = file('ttl-with-bom.ttl')
bom = StringIO(f.read())
f.close()
bomg= ConjunctiveGraph()
bomg.parse(bom, format='n3')

Original comment by azarot...@gmail.com on 5 Jan 2011 at 4:10

GoogleCodeExporter commented 9 years ago

Original comment by ed.summers on 5 Jan 2011 at 5:27

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r1900.

Original comment by ed.summers on 5 Jan 2011 at 5:49

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

This issue was closed by revision a83de1008e93.

Original comment by ed.summers on 30 Mar 2011 at 9:07