asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Patch for /src/main/java/edu/uci/ics/crawler4j/parser/HtmlContentHandler.java #252

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In some cases, e.g, in a bullet list, the last word in the first line would be 
accidentally concatenated with the first word in the second line. 
By adding a whitespace after each word, we can avoid this kind of situation.
And we also want to handle the case that we want 'can't' instead of 'can' and 
''t', so we add a check. If the first char is single quotes, aka, '\'', we 
delete the last char of the bodyText, in order to concatenate strings such as 
'can' and ''t'.

Original issue reported on code.google.com by Kun.Hu7...@gmail.com on 29 Jan 2014 at 7:33

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:49