Not sure whether I get it right. in the DocumentScorer.kt, I think the code here is using wrong judgement:
class DocumentScorer(private val stopWords: StopWords) : Scorer {
override fun score(doc: Document): ScoredElement? {
val nodesWithText = mutableListOf<Element>()
val nodesToCheck = doc.select("p, pre, td")
nodesToCheck.forEach { node ->
val text = node.text()
val wordStats = stopWords.statistics(text)
val hasHighLinkDensity = NodeHeuristics.hasHighLinkDensity(node)
// if stopWords.size is bigger than 2, this node should be ignored, rather than added to nodesWithText?
// this should be changed to: wordStats.stopWords.size <= 2
if (wordStats.stopWords.size > 2 && !hasHighLinkDensity) {
nodesWithText.add(node)
}
}
......
}
}
I think we meant to find the the nodes with good text, and not containing a lot of stopwords, right?
Not sure whether I get it right. in the DocumentScorer.kt, I think the code here is using wrong judgement:
I think we meant to find the the nodes with good text, and not containing a lot of stopwords, right?