Hi, in the Dragnet post-processed dataset, there are 399 files containing the string "!@#$%^&*() COMMENTS" followed by the comment part of the page. It's the only dataset with this kind of information, and since most extractors ignore comments (with the notable exception of trafilatura, which produce an optional comments body), I think the benchmark is slightly improvable in this regard.
Hi, in the Dragnet post-processed dataset, there are 399 files containing the string "!@#$%^&*() COMMENTS" followed by the comment part of the page. It's the only dataset with this kind of information, and since most extractors ignore comments (with the notable exception of trafilatura, which produce an optional comments body), I think the benchmark is slightly improvable in this regard.