UDST / urbanaccess

A tool for GTFS transit and OSM pedestrian network accessibility analysis by UrbanSim
https://udst.github.io/urbanaccess/index.html
GNU Affero General Public License v3.0
234 stars 56 forks source link

Trying to read utf-8 file on cp1252 system #79

Closed Ar-Kan closed 3 years ago

Ar-Kan commented 3 years ago

Description of the bug

Trying to read utf-8 file on cp1252 system at _txt_header_whitespace_check.

As far as I know, Python uses some system encoding information to read files, leading to errors like this, where it needs to pass as a parameter the correct encoding. It would be nice if we could pass it from Urbanaccess itself.

Environment

Paste the code that reproduces the issue here:

import urbanaccess as ua
loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(
    validation=True,
    bbox=bbox,
    remove_stops_outsidebbox=True,
    append_definitions=True
)

Paste the error message (if applicable):

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
c:\Users\arqui\Documents\Repositorios\urbanaccess-poa\urban.py in 
      27 append_definitions = True
     28 
---> 29 loaded_feeds = ua.gtfs.load.gtfsfeed_to_df(
     30     gtfsfeed_path=gtfsfeeds,
     31     validation=validation,

~\anaconda3\envs\geo\lib\site-packages\urbanaccess\gtfs\load.py in gtfsfeed_to_df(gtfsfeed_path, validation, verbose, bbox, remove_stops_outsidebbox, append_definitions)
    220                 'must be specified for validation.')
    221 
--> 222     _standardize_txt(csv_rootpath=gtfsfeed_path)
    223 
    224     folderlist = [foldername for foldername in os.listdir(gtfsfeed_path) if

~\anaconda3\envs\geo\lib\site-packages\urbanaccess\gtfs\load.py in _standardize_txt(csv_rootpath)
     35     if six.PY2:
     36         _txt_encoder_check(gtfsfiles_to_use, csv_rootpath)
---> 37     _txt_header_whitespace_check(gtfsfiles_to_use, csv_rootpath)
     38 
     39 

~\anaconda3\envs\geo\lib\site-packages\urbanaccess\gtfs\load.py in _txt_header_whitespace_check(gtfsfiles_to_use, csv_rootpath)
    127                 # Read from file
    128                 with open(os.path.join(csv_rootpath, folder, textfile)) as f:
--> 129                     lines = f.readlines()
    130                 lines[0] = re.sub(r'\s+', '', lines[0]) + '\n'
    131                 # Write to file

~\anaconda3\envs\geo\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1570: character maps to <undefined>
sablanchard commented 3 years ago

thanks for the suggestion @Ar-Kan ! Can you also provide a link to the GTFS data you trying to use when you encountered this? We can take a look after we have that information at how to best add a parameter for this.

Ar-Kan commented 3 years ago

Of course @sablanchard, I thot it was irrelevant, I'm sorry. Here it is: https://dadosabertos.poa.br/dataset/gtfs

I observe that the the Python function readlines() has worked after I (manually) added the parameter encoding='utf-8'.

sablanchard commented 3 years ago

No worries! Thank you @Ar-Kan we will take a closer look at how to best expose this and will get back to you once we do.

sablanchard commented 3 years ago

Hi @Ar-Kan we have addressed this issue in this PR: https://github.com/UDST/urbanaccess/pull/80 can you confirm that this solves the issue for you?

Ar-Kan commented 3 years ago

Hi @sablanchard, I just checked, it is working properly, thank you for the quick response.