attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.69k stars 959 forks source link

Wikidata Extraction #325

Open vishwa27yvs opened 7 months ago

vishwa27yvs commented 7 months ago

Is it possible to parse Wikidata dumps using wikiextractor. I have been trying to run python -m wikiextractor.WikiExtractor wikidatawiki-20231120-pages-articles.xml.bz2 --json --templates wikidata_template_file

but the extracted json files are of the format, and no text has been extracted

{"id": "111", "revid": "2876004", "url": "https://www.wikidata.org/wiki?curid=111", "title": "Q15", "text": ""}
{"id": "114", "revid": "192818", "url": "https://www.wikidata.org/wiki?curid=114", "title": "Q17", "text": ""}
{"id": "115", "revid": "1433337", "url": "https://www.wikidata.org/wiki?curid=115", "title": "Q18", "text": ""}
{"id": "118", "revid": "734320", "url": "https://www.wikidata.org/wiki?curid=118", "title": "Q20", "text": ""}
{"id": "119", "revid": "306431", "url": "https://www.wikidata.org/wiki?curid=119", "title": "Q21", "text": ""}
{"id": "120", "revid": "734320", "url": "https://www.wikidata.org/wiki?curid=120", "title": "Q22", "text": ""}
{"id": "121", "revid": "734320", "url": "https://www.wikidata.org/wiki?curid=121", "title": "Q25", "text": ""}
{"id": "122", "revid": "734320", "url": "https://www.wikidata.org/wiki?curid=122", "title": "Q26", "text": ""}
{"id": "123", "revid": "123021", "url": "https://www.wikidata.org/wiki?curid=123", "title": "Q27", "text": ""}
{"id": "124", "revid": "5799034", "url": "https://www.wikidata.org/wiki?curid=124", "title": "Q28", "text": ""}