OpenDataServices / cove

CoVE is an web application to Convert, Validate and Explore data following certain open data standards - including 360Giving, Open Contracting Data Standard, IATI and the Beneficial Ownership Data Standard
http://cove.opendataservices.coop
Other
43 stars 11 forks source link

UnicodeDecodeError 360Giving File #220

Closed ekoner closed 8 years ago

ekoner commented 8 years ago

CoVE throws a UnicodeDecodeError('charmap' 'character maps to '): http://dev.cove.opendataservices.coop/360/data/2a84415d-eb17-42e5-a083-e36eefc02d32

Original file (85MB): https://drive.google.com/drive/folders/0ByPHIGlaBXq1U1lJTFgtSHpkbmc

Bjwebb commented 8 years ago

Running this locally I get this traceback:

Traceback (most recent call last):
  File "/home/bjwebb/opendataservices/cove/cove/views.py", line 384, in convert_spreadsheet
    encoding=encoding
  File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/__init__.py", line 125, in unflatten
    base[main_sheet_name] = list(spreadsheet_input.unflatten())
  File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/input.py", line 115, in unflatten
    for line in self.get_main_sheet_lines():
  File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/input.py", line 30, in convert_dict_titles
    for d in dicts:
  File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/input.py", line 204, in get_sheet_lines
    for line in dictreader:
  File "/usr/lib/python3.4/csv.py", line 110, in __next__
    row = next(self.reader)
  File "/home/bjwebb/opendataservices/cove/.ve/lib/python3.4/encodings/cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7906: character maps to <undefined>

Looks to me like Cove is identifying the file as Windows-1252, but then it's tripping over the byte 8F , which is not in this encoding https://en.wikipedia.org/wiki/Windows-1252

$ cat BigLottery360GivingData\ \(1\).csv | xxd | grep -P ' 8f|8f ' | head -n 10
00f05ee0: 6372 8f63 6865 2066 6163 696c 6974 7920  cr.che facility 
00f05ef0: 616e 6420 656d 706c 6f79 2061 2063 728f  and employ a cr.
00f05f90: 6865 2063 728f 6368 6520 616e 6420 6861  he cr.che and ha
00f05fd0: 6420 6f6e 2d63 6f73 7420 6f66 2063 728f  d on-cost of cr.
00f16300: 4f33 3930 3335 2c46 8f69 7320 476c 6561  O39035,F.is Glea
00f23230: 8f63 6865 2066 6163 696c 6974 6965 7320  .che facilities 
00f23310: 6c20 6869 7265 2061 6e64 2063 728f 6368  l hire and cr.ch
00f367d0: 2c20 6372 8f63 6865 2061 6e64 2074 6173  , cr.che and tas
00f8ca10: 6972 652c 2063 728f 6368 6520 7374 6166  ire, cr.che staf
00fa5030: 6f6d 2068 6972 652c 2063 728f 6368 6520  om hire, cr.che 

I would suggest the step is to ask the publisher to fix the file so that it's either valid Windows-1252 or UTF-8. Alternatively we could try and be more permissive in our decoding of Windows-1252.