Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
https://smartreader.inre.me
Apache License 2.0
160 stars 36 forks source link

System.Text.RegularExpressions.RegexParseException: Invalid pattern #56

Closed iansmirlis closed 1 year ago

iansmirlis commented 1 year ago

Hello

Readability.CleanTitle() should properly escape string variable siteName with Regex.Escape() before it's applied as Regex Pattern, otherwise could lead to strange behavior as the following:

System.Text.RegularExpressions.RegexParseException: Invalid pattern '(.*) [|-\/>»] «Τζιαι τούτοι σάλια, αρκοφωνές τζιαι ποταμούς μελάνιν»*.' at offset 73. Nested quantifier '*'. at System.Text.RegularExpressions.RegexParser.ScanRegex() at System.Text.RegularExpressions.RegexParser.Parse(String pattern, RegexOptions options, CultureInfo culture) at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options, TimeSpan matchTimeout, CultureInfo culture) at System.Text.RegularExpressions.RegexCache.GetOrAdd(String pattern, RegexOptions options, TimeSpan matchTimeout) at System.Text.RegularExpressions.Regex.Replace(String input, String pattern, String replacement, RegexOptions options) at SmartReader.Readability.CleanTitle(String title, String siteName) at SmartReader.Readability.GetArticleMetadata(IHtmlDocument doc, Uri uri, String language, Dictionary`2 jsonLD) at SmartReader.Reader.Parse() at SmartReader.Reader.GetArticle()

Example URL: http://simerini.sigmalive.com/article/2015/11/30/tziai-toutoi-salia-arkophones-tziai-potamous-melanin/

index.html.gz

gabriele-tomassetti commented 1 year ago

Thanks for the excellent bug report and even suggesting a fix.