Closed wayneworkman closed 1 year ago
I had this same issue - it was due to using the multistream file (wikidatawiki-20230320-pages-articles-multistream.xml.bz2) rather than the normal pages-article file. Switching to enwiki-latest-pages-articles.xml.bz2 fixed the issue.
@kevincstowe that fixed the issue. Thank you.
I am using the file - wikidatawiki-20231120-pages-articles.xml.bz2, but I still face this issue where there is no extracted text even when it is not a multistream file.
@wayneworkman, @kevincstowe could you please help on this can be resolved. I need the articles of Wikidata entities (not Wikipedia articles)
Hello,
I'm running wikiextractor 3.0.6 on Debian 11. I'm trying to extract this big file:
wikidatawiki-20220820-pages-articles-multistream.xml.bz2
, 128GB in size.I'm using this command:
nohup wikiextractor --json --templates template_file wikidatawiki-20220820-pages-articles-multistream.xml.bz2 &
I see output as it runs:
However, when I look in the text directory to see the extracted text, no file has article text. All the files produced look like this:
I don't know what I'm doing wrong, I'm not sure what the issue is. Would appreciate any help.