For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line:
February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters.
It is parsed from the attached HTML fragment (could not find a good way to embed the HTML here).
fragment.html.txt
The problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.
For example, in CC-MAIN-20170629154125-20170629174125-00719.warc.wet, the parsed text for the URL http://awaywithwords.co/category/general/ contains the line:
February 25, 2017by Catherine Heath9 min readAdd Comment One thing I’m surprised by in my career (in less than a year at professional blogging) is the haters.
It is parsed from the attached HTML fragment (could not find a good way to embed the HTML here). fragment.html.txtThe problem for me is when multiple words become one, e.g. "Heath9", but not having a newline before "One thing" is also strange.
Original Google Groups discussion: https://groups.google.com/forum/#!topic/common-crawl/heyZMsBT4YY