althonos / pronto

A Python frontend to (Open Biomedical) Ontologies.
https://pronto.readthedocs.io
MIT License
231 stars 48 forks source link

Pronto fails to load OPMI by inferring wrong encoding #221

Open ElDeveloper opened 8 months ago

ElDeveloper commented 8 months ago

To reproduce download OWL formatted OPMI from BioPortal.

import pronto
pronto.Ontology('opmi-merged.owl')

The exception below shows up. You'll notice the warning stating that "Windows-1252" was assumed. If I go to io.py and change this line to force "utf-8" as the encoding the file loads just fine. Is there another way to change the encoding of the file I'm loading?

/var/folders/4b/gklb4t292nq0vyg08x59gjzc0000gn/T/ipykernel_29628/2923417831.py:1: UnicodeWarning: unsound encoding, assuming Windows-1252 (73% confidence)
  Ontology('/Users/yoshiki/Downloads/opmi-merged.owl')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[188], line 1
----> 1 Ontology('/Users/yoshiki/Downloads/opmi-merged.owl')

File ~/miniconda3/envs/db/lib/python3.11/site-packages/pronto/ontology.py:283, in Ontology.__init__(self, handle, import_depth, timeout, threads)
    281 for cls in BaseParser.__subclasses__():
    282     if cls.can_parse(typing.cast(str, self.path), buffer):
--> 283         cls(self).parse_from(_handle)  # type: ignore
    284         break
    285 else:

File ~/miniconda3/envs/db/lib/python3.11/site-packages/pronto/parsers/rdfxml.py:84, in RdfXMLParser.parse_from(self, handle, threads)
     82 def parse_from(self, handle, threads=None):
     83     # Load the XML document into an XML Element tree
---> 84     tree: etree.ElementTree = etree.parse(handle)
     86     # Load metadata from the `owl:Ontology` element
     87     owl_ontology = tree.find(_NS["owl"]["Ontology"])

File ~/miniconda3/envs/db/lib/python3.11/xml/etree/ElementTree.py:1218, in parse(source, parser)
   1209 """Parse XML document into element tree.
   1210 
   1211 *source* is a filename or file object containing XML data,
   (...)
   1215 
   1216 """
   1217 tree = ElementTree()
-> 1218 tree.parse(source, parser)
   1219 return tree

File ~/miniconda3/envs/db/lib/python3.11/xml/etree/ElementTree.py:580, in ElementTree.parse(self, source, parser)
    574     parser = XMLParser()
    575     if hasattr(parser, '_parse_whole'):
    576         # The default XMLParser, when it comes from an accelerator,
    577         # can define an internal _parse_whole API for efficiency.
    578         # It can be used to parse the whole source without feeding
    579         # it with chunks.
--> 580         self._root = parser._parse_whole(source)
    581         return self._root
    582 while True:

File ~/miniconda3/envs/db/lib/python3.11/site-packages/pronto/utils/io.py:24, in BufferedReader.read(self, size)
     22 def read(self, size: Optional[int] = -1) -> bytes:
     23     try:
---> 24         return super(BufferedReader, self).read(size)
     25     except ValueError:
     26         if typing.cast(io.BufferedReader, self.closed):

File ~/miniconda3/envs/db/lib/python3.11/site-packages/pronto/utils/io.py:60, in EncodedFile.readinto(self, buffer)
     59 def readinto(self, buffer: ByteString) -> int:
---> 60     chunk = self.read(len(buffer) // 2)
     61     typing.cast(bytearray, buffer)[: len(chunk)] = chunk
     62     return len(chunk)

File ~/miniconda3/envs/db/lib/python3.11/site-packages/pronto/utils/io.py:56, in EncodedFile.read(self, size)
     55 def read(self, size: Optional[int] = -1) -> bytes:
---> 56     chunk = super().read(-1 if size is None else size)
     57     return chunk.replace(b"\r\n", b"\n")

File <frozen codecs>:814, in read(self, size)

File <frozen codecs>:507, in read(self, size, chars, firstline)

File ~/miniconda3/envs/db/lib/python3.11/encodings/cp1252.py:15, in Codec.decode(self, input, errors)
     14 def decode(self,input,errors='strict'):
---> 15     return codecs.charmap_decode(input,errors,decoding_table)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29335: character maps to <undefined>