EnzymeML / PyEnzyme

🧬 - Data management and modeling framework based on EnzymeML.
BSD 2-Clause "Simplified" License
23 stars 9 forks source link

writing documents fails because of encoding errors when writing the log file #32

Closed fbergmann closed 2 years ago

fbergmann commented 2 years ago

I currently have the issue that the code does not allow writing out documents because writing the history failed:

enzymemlwriter.py:179, in EnzymeMLWriter._createArchive(self, enzmldoc, listofPaths, name)
    177 history_path = f"{self.path}/history.log"
    178 with open(history_path, "w") as f:
--> 179     f.write(enzmldoc.log.getvalue())
    181 self.addFileToArchive(
    182     archive=archive,
    183     file_path=history_path,
   (...)
    186     description="History of the EnzymeML document",
    187 )
    189 # add metadata to the experiment file

File cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
     18 def encode(self, input, final=False):
---> 19     return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\udcb7' in position 4532: character maps to <undefined>

It is important that the encoding is defined when writing the history. If you dont different operating systems will use different encodings.

fbergmann commented 2 years ago

actually, trying out different encodings, only to have them all fail, i think it would be best if you just write binary (using open mode "wb"), that way no matter what was written into the StringIO buffer, you can write it out.

JR-1991 commented 2 years ago

Could you provide an example to reproduce the error? On which OS did it happen? Want to make sure it's not an error from somewhere else.

fbergmann commented 2 years ago

the file, i sent earlier has the issue. The problem came about, since when i created the file earlier by running Cephalexin_Synthesis_Model4.ipynb the history.log created was written out as ANSI code (since the enzymeml writer did not specify an encoding, it takes the in this case windows native encoding). For example the file i mailed around earlier will have the problem.

Then if that file is read again by the enzymemlreader, and consequently written out by the enzymeml writer, the issue occurs.

fbergmann commented 2 years ago

so i recreated the file using the python notebook mentioned above with:

            with open(history_path, "w", encoding='utf-8') as f:
                f.write(enzmldoc.log.getvalue())

in the reader. Then i can roundtrip the files. (it will still fall over the old file that i mailed around earlier). So the only alternative is to write the file as binary, and manually encoding the string returned from the stringio object. there you then have the option of specifying what should happen in case of error:

JR-1991 commented 2 years ago

Thanks for fixing that! I would propose that on failure an empty history file will be written. Otherwise the reader might thrown an error if there is none.