Closed grodino closed 1 year ago
Thanks! The HtmlDocExtractorDocStoreWrapper
doesn't get much attention, so thanks for the improvements and bug fix.
Mind adding a unit test for it?
Mind adding a unit test for it?
Done but I still couldn't run the tests myself.
I tried python -m test.integration.clueweb12 TestClueWeb12.test_clueweb12_docs_html
and got an error
======================================================================
ERROR: test_clueweb12_docs_html (__main__.TestClueWeb12) [docs_iter split] (dataset=<ir_datasets.wrappers.html_extractor.HtmlDocExtractor object at 0x7f42cb444130>)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/agodinot/experiments/ir_datasets/test/integration/base.py", line 38, in _test_docs
self._assert_namedtuple(next(it[idx:idx+1]), doc)
File "/home/agodinot/experiments/ir_datasets/ir_datasets/wrappers/html_extractor.py", line 33, in __next__
return next(self.mapped_it)
StopIteration
As if the test case could not retrieve the documents :/ (I tried just using docstore.get_many()
in the test case and it worked)
Thanks! I didn't have time to go over this today, but I'll look into it tomorrow.
The code looks good to me :+1:
The added dependency inscriptis
is not too big (~40kb) and seems to be well-maintained.
Tests are passing. Let's get this merged :wink:
Thanks for bumping this PR @heinrichreimer, and thanks @grodino for the contribution!
Hi !
Lately, I tried using the HTML extractor wrapper for clueweb documents. When wrapping directly the corpus docstore, everything works fine but when composing vrappers (dataset -> html extractor -> cachedocstore), I had an error because the html docstore wrapper did not
super().__init__
itself.This PR thus does two things :
HtmlDocExtractorDocStoreWrapper
inscriptis
(it seem that empirically, there is less junk chars than the vanillabs4
extractor)Cheers, Augustin