encoding issue while writing out experiment

PEtab-dev / libpetab-python

Python package for working with PEtab files

https://libpetab-python.readthedocs.io

MIT License

14 stars 6 forks source link

encoding issue while writing out experiment #77

Closed fbergmann closed 2 years ago

fbergmann commented 2 years ago

If after reading the Sneyd_PNAS2002 model from the benchmark collection, i write it out again using the petab library, i get the following exception:

self = <encodings.cp1252.IncrementalEncoder object at 0x000001DBE8E097B8>
input = 'Ca_dose_response__1\t10 μM IP_3,  0.1 μM Ca^{2+}\t10.0\t0.1\r\r\n'
final = False

    def encode(self, input, final=False):
>       return codecs.charmap_encode(input,self.errors,encoding_table)[0]
E       UnicodeEncodeError: 'charmap' codec can't encode character '\u03bc' in position 23: character maps to <undefined>

cp1252.py:19: UnicodeEncodeError

it turns out the unit is given in special characters. Since PEtab had no problems reading it, maybe the writers should pass a utf8 encoding attribute when writing out the tables. Rather than just:

conditions.py:55: in write_condition_df
    df.to_csv(fh, sep='\t', index=True)

dweindl commented 2 years ago

Thanks for reporting, Frank. Will fix that.

dweindl commented 2 years ago

Hm, seems to work on my machine. According to the pandas docs, the default for writing is already utf-8 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). I do get the same error if I explicitly request cp1252. Is the problem solved for you with df.to_csv(..., encoding="utf-8")? Maybe the pandas documentation is wrong there...

fbergmann commented 2 years ago

the issue is that different operating systems use different default encodings. and there is a certain os that is widely used that defaults to cp1252. I'll try running with the different encoding to see what happens.

fbergmann commented 2 years ago

Actually just adding the encoding did not work ... turns out encoding is only supported if we specify filenames and write in binary mode, so:

    df.to_csv(filename, sep='\t', index=True, encoding="utf-8")

works

with open(filename, 'w') as fh:
    df.to_csv(fh, sep='\t', index=True, encoding="utf-8")

does not. But i dont see why we need to manually open the file, rather than just let to_csv does the work.

dweindl commented 2 years ago

the issue is that different operating systems use different default encodings. and there is a certain os that is widely used that defaults to cp1252. I'll try running with the different encoding to see what happens.

Sure, we should fix that for windows. Only I am not sure what exactly is the problem. At least for the most recent pandas, the code seemed to agree with the docs in that the default encoding is utf-8, also on Windows. (Which pandas did you use?) I can't really test Windows things here, so we go for whatever you confirm me to solve the problem.

fbergmann commented 2 years ago

I think the problem is that we manually open the file in 'w' mode, and the encoding argument (which defaults to utf-8) only works on binary output. So this would work too:

     df.to_csv(filename, sep='\t', index=True)

as would:

with open(filename, 'wb') as fh:
    df.to_csv(fh, sep='\t', index=True)