Closed Albert-Ma closed 3 years ago
I think it is because of BBC having updated their HTML structure. In the current structure, new text is wrapped in three-level tags, a <p>
tag, two <div>
tags, and an <article>
tag i.e.,
<article>
<div><div><p>text1</p></div></div>
<div><div><p>text2</p></div></div>
</article>
This structure cannot be recognized by the algorithms of Goose3. Therefore, I overwrote two functions in ContentExtractor
of Goose3.
why the
article.text
extracted by crawler from BBC html is too short?