attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

Help! DO dump files contain the wikitable in the wikipedia? #247

Open HamLaertes opened 3 years ago

HamLaertes commented 3 years ago

Hello everyone. I downloaded the first file enwiki-20210220-pages-articles1.xml-p1p41242.bz2 at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable. Do I miss something or the dump files not contain the table information at all? Thanks!

HamLaertes commented 3 years ago

I think I've got the answer myself. The dump files actually contain the wikitable information but in a different way. Adding the argument --html may help get the wikitable more directly. But the code seems to have bugs when converting xml to html. It reports KeyError as follows:

File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/storage/fbzhu/yc/wikiextractor/wikiextractor/WikiExtractor.py", line 467, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 857, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 847, in clean_text
    text = compact(text, mark_headers=mark_headers)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 256, in compact
    page.append(listItem[n] % line)
KeyError: '&'

I am using the xml files dumped at 20 Feb 2021 and wikiextractor version 3.0.5.