Closed frmichel closed 3 years ago
Thanks @frmichel for reporting this issue. It explains what I had spotted with a sample crawl we had done over the pages as well. We were focusing on other issues so did not look into the problem.
Ok cool, thx, I'll follow on that one.
Hey guys, I've found out the problem: I'm running on a Windows system that defaults to Cp1252 charset. But on a Linux platform, that works fine.
It is possible to force the default charset to UTF-8 at JVM startup with -Dfile.encoding=UTF-8
. That fixes the problem. As simple as that... It actually took me quite some time to figure this out, after making lots of unsuccessful changes in the code itself ;)
I've updated the main README on my own fork, along with all my other changes on the configuration (https://github.com/frmichel/BMUSE/tree/dev_properties). We shall probably discuss those before I submit a pull request.
Thanks for updating us. These property settings are always very frustrating.
Hi Franck,
Thank you for highlighting and find a solution to the problem. It would be good to have a chat when you are available.
Best wishes Petros
Hi Franck,
I think this can be closed now.
Hi @petrospaps, sure, the info in on the README. I'm closing it.
Franck.
Hey guys, I'm giving a try to BMUSE to scrap the Taxon annotation on the pages of the Museum of Natural History of Paris. On this page: https://inpn.mnhn.fr/espece/cd_nom/60878, the result triples transform accentuated characters like 'à' or 'é' into '?'. Example:
<https://inpn.mnhn.fr/espece/cd_nom/60878> <http://purl.org/dc/terms/title> "Delphinus delphis Linnaeus, 1758 - Dauphin commun, Dauphin commun ? bec court, Dauphin commun ? bec long - Pr?sentation" .
This is usually related to UTF-8 characters not managed correctly. Have you noticed that already?
Franck.