Fix and add html extractor

grodino commented 2 years ago

Hi !

Lately, I tried using the HTML extractor wrapper for clueweb documents. When wrapping directly the corpus docstore, everything works fine but when composing vrappers (dataset -> html extractor -> cachedocstore), I had an error because the html docstore wrapper did not super().__init__ itself.

This PR thus does two things :

Init the HtmlDocExtractorDocStoreWrapper
Add a new html extractor based on inscriptis (it seem that empirically, there is less junk chars than the vanilla bs4 extractor)

Cheers, Augustin

seanmacavaney commented 2 years ago

Thanks! The HtmlDocExtractorDocStoreWrapper doesn't get much attention, so thanks for the improvements and bug fix.

Mind adding a unit test for it?

grodino commented 2 years ago

Mind adding a unit test for it?

Done but I still couldn't run the tests myself.

I tried python -m test.integration.clueweb12 TestClueWeb12.test_clueweb12_docs_html and got an error

======================================================================
ERROR: test_clueweb12_docs_html (__main__.TestClueWeb12) [docs_iter split] (dataset=<ir_datasets.wrappers.html_extractor.HtmlDocExtractor object at 0x7f42cb444130>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/agodinot/experiments/ir_datasets/test/integration/base.py", line 38, in _test_docs
    self._assert_namedtuple(next(it[idx:idx+1]), doc)
  File "/home/agodinot/experiments/ir_datasets/ir_datasets/wrappers/html_extractor.py", line 33, in __next__
    return next(self.mapped_it)
StopIteration

As if the test case could not retrieve the documents :/ (I tried just using docstore.get_many() in the test case and it worked)

seanmacavaney commented 2 years ago

Thanks! I didn't have time to go over this today, but I'll look into it tomorrow.

janheinrichmerker commented 1 year ago

The code looks good to me :+1: The added dependency inscriptis is not too big (~40kb) and seems to be well-maintained. Tests are passing. Let's get this merged :wink:

seanmacavaney commented 1 year ago

Thanks for bumping this PR @heinrichreimer, and thanks @grodino for the contribution!

allenai / ir_datasets

Fix and add html extractor #201