attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

wikiextractor 3.0.6 not extracting #306

Closed wayneworkman closed 1 year ago

wayneworkman commented 1 year ago

Hello,

I'm running wikiextractor 3.0.6 on Debian 11. I'm trying to extract this big file: wikidatawiki-20220820-pages-articles-multistream.xml.bz2, 128GB in size.

I'm using this command: nohup wikiextractor --json --templates template_file wikidatawiki-20220820-pages-articles-multistream.xml.bz2 &

I see output as it runs:

INFO: Preprocessed 102600000 pages
INFO: Preprocessed 102700000 pages
INFO: Preprocessed 102800000 pages
INFO: Preprocessed 102900000 pages
INFO: Preprocessed 103000000 pages
INFO: Preprocessed 103100000 pages
INFO: Saved 9924 templates to 'template_file'
INFO: Loaded 9924 templates in 100923.3s
INFO: Starting page extraction from wikidatawiki-20220820-pages-articles-multistream.xml.bz2.
INFO: Using 79 extract processes.
INFO: Extracted 100000 articles (378.1 art/s)
INFO: Extracted 200000 articles (487.5 art/s)
INFO: Extracted 300000 articles (549.8 art/s)
INFO: Extracted 400000 articles (622.6 art/s)
INFO: Extracted 500000 articles (592.0 art/s)
INFO: Extracted 600000 articles (687.6 art/s)

However, when I look in the text directory to see the extracted text, no file has article text. All the files produced look like this:

{"id": "2467937", "revid": "21728", "url": "https://www.wikidata.org/wiki?curid=2467937", "title": "Q2557445", "text": ""}
{"id": "2467938", "revid": "150965", "url": "https://www.wikidata.org/wiki?curid=2467938", "title": "Q2557446", "text": ""}
{"id": "2467939", "revid": "1554155", "url": "https://www.wikidata.org/wiki?curid=2467939", "title": "Q2557447", "text": ""}
{"id": "2467940", "revid": "150965", "url": "https://www.wikidata.org/wiki?curid=2467940", "title": "Q2557448", "text": ""}
{"id": "2467941", "revid": "150965", "url": "https://www.wikidata.org/wiki?curid=2467941", "title": "Q2557449", "text": ""}
{"id": "2467942", "revid": "119076", "url": "https://www.wikidata.org/wiki?curid=2467942", "title": "Q2557450", "text": ""}
{"id": "2467943", "revid": "2242783", "url": "https://www.wikidata.org/wiki?curid=2467943", "title": "Q2557451", "text": ""}
{"id": "2467944", "revid": "2709538", "url": "https://www.wikidata.org/wiki?curid=2467944", "title": "Q2557452", "text": ""}
{"id": "2467945", "revid": "5161409", "url": "https://www.wikidata.org/wiki?curid=2467945", "title": "Q2557453", "text": ""}
{"id": "2467946", "revid": "53290", "url": "https://www.wikidata.org/wiki?curid=2467946", "title": "Q2557454", "text": ""}
{"id": "2467947", "revid": "2883061", "url": "https://www.wikidata.org/wiki?curid=2467947", "title": "Q2557456", "text": ""}

I don't know what I'm doing wrong, I'm not sure what the issue is. Would appreciate any help.

kevincstowe commented 1 year ago

I had this same issue - it was due to using the multistream file (wikidatawiki-20230320-pages-articles-multistream.xml.bz2) rather than the normal pages-article file. Switching to enwiki-latest-pages-articles.xml.bz2 fixed the issue.

wayneworkman commented 1 year ago

@kevincstowe that fixed the issue. Thank you.

vishwa27yvs commented 7 months ago

I am using the file - wikidatawiki-20231120-pages-articles.xml.bz2, but I still face this issue where there is no extracted text even when it is not a multistream file.

@wayneworkman, @kevincstowe could you please help on this can be resolved. I need the articles of Wikidata entities (not Wikipedia articles)