Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
https://smartreader.inre.me
Apache License 2.0
160 stars 36 forks source link

System.Globalization.CultureNotFoundException #7

Closed iixi closed 6 years ago

iixi commented 6 years ago

Hi,

We are crawling a lot of different sites and trying to use your library to extract the article content. We noticed that some of the sites have invalid culture info in html lang attribute. When this happens, the reader will throw System.Globalization.CultureNotFoundException in Article.GetWeightTimeToRead.

What do you think would be a nice solution? Maybe add possibility to set fallback language or just fall back to CultureInfo.InvariantCulture if the provided culture info is invalid?

Great job by the way!

gabriele-tomassetti commented 6 years ago

Well, that's a bug. My code should catch the exception, but I evidently forgot to add it.

In case of invalid language it should already assume the CultureInfo.InvariantCulture. Then, it should return as a weight the default value of 960. Which is the average of the values for the languages in the study I use as a reference.

Thanks for having alerted me of the issue, I will fix it in the weekend.