commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

[WET] Missing spaces in parsed content #13

Closed pipldev closed 7 years ago

pipldev commented 7 years ago

For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line: February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters. It is parsed from the attached HTML fragment (could not find a good way to embed the HTML here). fragment.html.txt

The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.

Original Google Groups discussion: https://groups.google.com/forum/#!topic/common-crawl/heyZMsBT4YY

sebastian-nagel commented 7 years ago

Hi @pipldev, fixed for August crawl (CC-MAIN-2017-34). Thanks!