Open HamLaertes opened 3 years ago
I think I've got the answer myself. The dump files actually contain the wikitable information but in a different way.
Adding the argument --html
may help get the wikitable more directly. But the code seems to have bugs when converting xml to html.
It reports KeyError as follows:
File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/storage/fbzhu/yc/wikiextractor/wikiextractor/WikiExtractor.py", line 467, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/storage/wikiextractor/wikiextractor/extract.py", line 857, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/storage/wikiextractor/wikiextractor/extract.py", line 847, in clean_text
text = compact(text, mark_headers=mark_headers)
File "/storage/wikiextractor/wikiextractor/extract.py", line 256, in compact
page.append(listItem[n] % line)
KeyError: '&'
I am using the xml files dumped at 20 Feb 2021 and wikiextractor version 3.0.5.
Hello everyone. I downloaded the first file
enwiki-20210220-pages-articles1.xml-p1p41242.bz2
at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable. Do I miss something or the dump files not contain the table information at all? Thanks!