LAW-Unimi / BUbiNG

The LAW next generation crawler.
http://law.di.unimi.it/software.php#bubing
Apache License 2.0
85 stars 24 forks source link

HTML5 charset declaration not detected #11

Open guillaumepitel opened 6 years ago

guillaumepitel commented 6 years ago

Hi, we've had some issues with charset detection for some websites. Current implementation on BUbiNG lacks the appropriate regex to detect HTML5 declarations :

We've implemented it, as well as a fallback using ICU probabilistic charset detection (with a dependency on ICU).

I think HTML5 charset detection could easily be submitted to the main repo. How do you stand regarding a potential dependency on ICU ?

vigna commented 6 years ago

We're perfectly fine with that. Please submit a pull request (including adding icu to ivy.xml). Thanks!

ChuckNoxis commented 6 years ago

13 👍