HW-SWeL / BMUSE

Bioschemas Mark Up Scraper and Extractor
https://app.swaggerhub.com/apis-docs/swel/BMUSE/
Apache License 2.0
3 stars 5 forks source link

Bad management of (UTF8?) accented characters #51

Closed frmichel closed 3 years ago

frmichel commented 4 years ago

Hey guys, I'm giving a try to BMUSE to scrap the Taxon annotation on the pages of the Museum of Natural History of Paris. On this page: https://inpn.mnhn.fr/espece/cd_nom/60878, the result triples transform accentuated characters like 'à' or 'é' into '?'. Example:

<https://inpn.mnhn.fr/espece/cd_nom/60878> <http://purl.org/dc/terms/title> "Delphinus delphis Linnaeus, 1758 - Dauphin commun, Dauphin commun ? bec court, Dauphin commun ? bec long - Pr?sentation" .

This is usually related to UTF-8 characters not managed correctly. Have you noticed that already?

Franck.

AlasdairGray commented 4 years ago

Thanks @frmichel for reporting this issue. It explains what I had spotted with a sample crawl we had done over the pages as well. We were focusing on other issues so did not look into the problem.

frmichel commented 4 years ago

Ok cool, thx, I'll follow on that one.

frmichel commented 4 years ago

Hey guys, I've found out the problem: I'm running on a Windows system that defaults to Cp1252 charset. But on a Linux platform, that works fine.

It is possible to force the default charset to UTF-8 at JVM startup with -Dfile.encoding=UTF-8. That fixes the problem. As simple as that... It actually took me quite some time to figure this out, after making lots of unsuccessful changes in the code itself ;)

I've updated the main README on my own fork, along with all my other changes on the configuration (https://github.com/frmichel/BMUSE/tree/dev_properties). We shall probably discuss those before I submit a pull request.

AlasdairGray commented 4 years ago

Thanks for updating us. These property settings are always very frustrating.

petrospaps commented 4 years ago

Hi Franck,

Thank you for highlighting and find a solution to the problem. It would be good to have a chat when you are available.

Best wishes Petros

petrospaps commented 3 years ago

Hi Franck,

I think this can be closed now.

frmichel commented 3 years ago

Hi @petrospaps, sure, the info in on the README. I'm closing it.

Franck.