PabloCastellano / bormeparser

A Python library for parsing BORME files (Boletín Oficial del Registro Mercantil in Spain).
GNU General Public License v3.0
47 stars 20 forks source link

Spanish administration finally decided to switch to UTF8 instead of using iso-8859-1 #23

Closed Markcial closed 5 years ago

Markcial commented 5 years ago

The library now complains with a parsing issue on the xml with the current xmls from the BOE, thing is that seems to be that they have fixed the encoding on the XML files on their end. Don't know yet if they have fixed finally or they gonna be switching back and forth between encodings.

Thing is maybe we need to do a encoding check to know if the file is really utf8, or still is iso-8859-1. Or even use some kind of transliteration process in order to ignore such errors or problems and just import entities.

The xmls that have failed on our importing process are:

https://www.boe.es/diario_borme/xml.php?id=BORME-S-20190121 and https://www.boe.es/diario_borme/xml.php?id=BORME-S-20190122

vaijira commented 5 years ago

I have created a pull request for this issue with a tentative and quick fix.

Markcial commented 5 years ago

really good solution

PabloCastellano commented 5 years ago

Gracias por el aviso. Voy a revisar otras PR y hago una nueva release