[WET] Missing spaces in parsed content

For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line: February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters. It is parsed from the attached HTML fragment (could not find a good way to embed the HTML here). fragment.html.txt

The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.

Original Google Groups discussion: https://groups.google.com/forum/#!topic/common-crawl/heyZMsBT4YY

commoncrawl / ia-web-commons

[WET] Missing spaces in parsed content #13