Closed ekoner closed 8 years ago
Running this locally I get this traceback:
Traceback (most recent call last):
File "/home/bjwebb/opendataservices/cove/cove/views.py", line 384, in convert_spreadsheet
encoding=encoding
File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/__init__.py", line 125, in unflatten
base[main_sheet_name] = list(spreadsheet_input.unflatten())
File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/input.py", line 115, in unflatten
for line in self.get_main_sheet_lines():
File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/input.py", line 30, in convert_dict_titles
for d in dicts:
File "/home/bjwebb/opendataservices/cove/.ve/src/flattentool/flattentool/input.py", line 204, in get_sheet_lines
for line in dictreader:
File "/usr/lib/python3.4/csv.py", line 110, in __next__
row = next(self.reader)
File "/home/bjwebb/opendataservices/cove/.ve/lib/python3.4/encodings/cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7906: character maps to <undefined>
Looks to me like Cove is identifying the file as Windows-1252, but then it's tripping over the byte 8F
, which is not in this encoding https://en.wikipedia.org/wiki/Windows-1252
$ cat BigLottery360GivingData\ \(1\).csv | xxd | grep -P ' 8f|8f ' | head -n 10
00f05ee0: 6372 8f63 6865 2066 6163 696c 6974 7920 cr.che facility
00f05ef0: 616e 6420 656d 706c 6f79 2061 2063 728f and employ a cr.
00f05f90: 6865 2063 728f 6368 6520 616e 6420 6861 he cr.che and ha
00f05fd0: 6420 6f6e 2d63 6f73 7420 6f66 2063 728f d on-cost of cr.
00f16300: 4f33 3930 3335 2c46 8f69 7320 476c 6561 O39035,F.is Glea
00f23230: 8f63 6865 2066 6163 696c 6974 6965 7320 .che facilities
00f23310: 6c20 6869 7265 2061 6e64 2063 728f 6368 l hire and cr.ch
00f367d0: 2c20 6372 8f63 6865 2061 6e64 2074 6173 , cr.che and tas
00f8ca10: 6972 652c 2063 728f 6368 6520 7374 6166 ire, cr.che staf
00fa5030: 6f6d 2068 6972 652c 2063 728f 6368 6520 om hire, cr.che
I would suggest the step is to ask the publisher to fix the file so that it's either valid Windows-1252 or UTF-8. Alternatively we could try and be more permissive in our decoding of Windows-1252.
CoVE throws a UnicodeDecodeError('charmap' 'character maps to '): http://dev.cove.opendataservices.coop/360/data/2a84415d-eb17-42e5-a083-e36eefc02d32
Original file (85MB): https://drive.google.com/drive/folders/0ByPHIGlaBXq1U1lJTFgtSHpkbmc