Extract too short text - Githubissues

LuChang-CS / news-crawler

A news crawler for BBC News, Reuters and New York Times.

108 stars 40 forks source link

Extract too short text #5

Closed Albert-Ma closed 3 years ago

Albert-Ma commented 3 years ago

why the article.text extracted by crawler from BBC html is too short?

LuChang-CS commented 3 years ago

I think it is because of BBC having updated their HTML structure. In the current structure, new text is wrapped in three-level tags, a <p> tag, two <div> tags, and an <article> tag i.e.,

<article>
    <div><div><p>text1</p></div></div>
    <div><div><p>text2</p></div></div>
</article>

This structure cannot be recognized by the algorithms of Goose3. Therefore, I overwrote two functions in ContentExtractor of Goose3.