cdimascio / essence

Automatically extract the main text content (and more) from an HTML document
Apache License 2.0
116 stars 16 forks source link

DocumentScorer.kt stopwords.size > 2 seems to be wrong #9

Open zaixiaguozhen opened 2 years ago

zaixiaguozhen commented 2 years ago

Not sure whether I get it right. in the DocumentScorer.kt, I think the code here is using wrong judgement:

class DocumentScorer(private val stopWords: StopWords) : Scorer {

    override fun score(doc: Document): ScoredElement? {
        val nodesWithText = mutableListOf<Element>()
        val nodesToCheck = doc.select("p, pre, td")
        nodesToCheck.forEach { node ->
            val text = node.text()
            val wordStats = stopWords.statistics(text)
            val hasHighLinkDensity = NodeHeuristics.hasHighLinkDensity(node)
            // if stopWords.size is bigger than 2, this node should be ignored, rather than added to nodesWithText?
           // this should be changed to: wordStats.stopWords.size <= 2
            if (wordStats.stopWords.size > 2 && !hasHighLinkDensity) {
                nodesWithText.add(node)
            }
        }
        ......
   }
}

I think we meant to find the the nodes with good text, and not containing a lot of stopwords, right?