chatnoir-eu / web-content-extraction-benchmark

Web Content Extraction Benchmark
Apache License 2.0
16 stars 5 forks source link

COMMENTS in Dragnet ground-truth #4

Open guillaumepitelbabbartech opened 5 months ago

guillaumepitelbabbartech commented 5 months ago

Hi, in the Dragnet post-processed dataset, there are 399 files containing the string "!@#$%^&*() COMMENTS" followed by the comment part of the page. It's the only dataset with this kind of information, and since most extractors ignore comments (with the notable exception of trafilatura, which produce an optional comments body), I think the benchmark is slightly improvable in this regard.