TheScienceMuseum / elastic-wikidata

CLI for loading Wikidata subsets (or all of it) into Elasticsearch
https://www.sciencemuseumgroup.org.uk/project/heritage-connector/
MIT License
67 stars 7 forks source link

I have some problem while trying to load #14

Closed xzhaoyooo closed 3 years ago

xzhaoyooo commented 3 years ago

Hello guys,

I was trying to load a Wikidata dump generated by wikibase-dump-filter but it threw some error messages. I totally had no idea what I should do to fix it, hope you can give me some help.

ew dump -p f:\ProjectWDS\dp.ndjson --cluster 127.0.0.1:9200 --user elastic --password changeme
c:\programdata\anaconda3\lib\site-packages\requests\__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.2) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Elasticsearch index: wds
Connecting to Elasticsearch at 127.0.0.1:9200
Temporary disabling refresh for the index. Will reset refresh interval to the default (1s) after load is complete.
Indexing documents...
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\programdata\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\Scripts\ew.exe\__main__.py", line 7, in <module>
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda3\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\cli.py", line 126, in main
    load_from_dump(path, es_credentials, index, limit, **kwargs)
  File "c:\programdata\anaconda3\lib\site-packages\cli.py", line 144, in load_from_dump
    d.dump_to_es()
  File "c:\programdata\anaconda3\lib\site-packages\elastic_wikidata\dump_to_es.py", line 131, in dump_to_es
    queue_size=self.config["queue_size"],
  File "c:\programdata\anaconda3\lib\site-packages\tqdm\std.py", line 1167, in __iter__
    for obj in iterable:
  File "c:\programdata\anaconda3\lib\site-packages\elasticsearch\helpers\actions.py", line 425, in parallel_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "c:\programdata\anaconda3\lib\multiprocessing\pool.py", line 748, in next
    raise value
  File "c:\programdata\anaconda3\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "c:\programdata\anaconda3\lib\multiprocessing\pool.py", line 140, in _helper_reraises_exception
    raise ex
  File "c:\programdata\anaconda3\lib\multiprocessing\pool.py", line 292, in _guarded_task_generation
    for i, x in enumerate(iterable):
  File "c:\programdata\anaconda3\lib\site-packages\elasticsearch\helpers\actions.py", line 128, in _chunk_actions
    for action, data in actions:
  File "c:\programdata\anaconda3\lib\site-packages\elastic_wikidata\dump_to_es.py", line 167, in generate_actions_from_dump
    doc = self.process_doc(item)
  File "c:\programdata\anaconda3\lib\site-packages\elastic_wikidata\dump_to_es.py", line 152, in process_doc
    return simplify_wbgetentities_result(doc, lang, properties)
  File "c:\programdata\anaconda3\lib\site-packages\elastic_wikidata\wd_entities.py", line 155, in simplify_wbgetentities_result
    newdoc["labels"] = doc["labels"][lang]["value"]
TypeError: string indices must be integers
kdutia commented 3 years ago

It looks like there might be multiple labels in ‘doc’ which I haven’t seen before. Do you have a way of seeing what the doc object looks like?

xzhaoyooo commented 3 years ago

It looks like there might be multiple labels in ‘doc’ which I haven’t seen before. Do you have a way of seeing what the doc object looks like?

Hello Duita,

Thank you for your quick reply! I added a try-except statement to print the doc, and below is how it looks like:

Indexing documents...
0it [00:00, ?it/s]something wrong
{'id': 'Q31'}
kdutia commented 3 years ago

Where are you printing doc from? That doesn't look like it's correct, as line 154 means that elastic-wikidata will only look for labels if the 'labels' key exists in the index.

Are you printing newdoc instead of doc in your try-except?

xzhaoyooo commented 3 years ago

Where are you printing doc from? That doesn't look like it's correct, as line 154 means that elastic-wikidata will only look for labels if the 'labels' key exists in the index.

Are you printing newdoc instead of doc in your try-except?

Yes, sorry for my carelessness. Here's the right doc below:

{'id': 'Q31', 'type': 'item', 'labels': {'en': 'Belgium', 'de': 'Belgien', 'en-gb': 'Belgium', 'it': 'Belgio', 'nb': 'Belgia', 'eo': 'Belgio', 'pl': 'Belgia', 'ru': 'Бельгия', 'es': 'Bélgica', 'be-tarask': 'Бэльгія', 'sgs': 'Belgėjė', 'rup': 'Belghia', 'nan': 'Belgien', 'vro': 'Belgiä', 'roa-tara': 'Bèlge', 'yue': '比利時', 'nds-nl': 'België', 'nl': 'België', 'br': 'Belgia', 'fr': 'Belgique', 'ja': 'ベルギー', 'zh-hant': '比利時', 'en-ca': 'Belgium', 'wa': 'Beldjike', 'pt': 'Bélgica', 'mk': 'Белгија', 'la': 'Belgica', 'hsb': 'Belgiska', 'dsb': 'Belgiska', 'ace': 'Bèlgia', 'af': 'België', 'am': 'ቤልጅግ', 'an': 'Belchica', 'ang': 'Belgice', 'ar': 'بلجيكا', 'arc': 'ܒܠܓܝܩܐ', 'arz': 'بلجيكا', 'ast': 'Bélxica', 'ay': 'Bilkiya', 'az': 'Belçika', 'ba': 'Бельгия', 'bar': 'Bäigien', 'bcl': 'Belhika', 'be': 'Бельгія', 'bg': 'Белгия', 'bi': 'Belgium', 'bn': 'বেলজিয়াম', 'bo': 'པེར་ཅིན།', 'bpy': 'বেলজিয়াম', 'bs': 'Belgija', 'bug': 'Belgia', 'bxr': 'Бельги', 'ca': 'Bèlgica', 'cdo': 'Bī-lé-sì', 'ce': 'Бельги', 'ceb': 'Belhika', 'chr': 'ᏇᎵᏥᎥᎻ', 'ckb': 'بەلجیکا', 'co': 'Belgica', 'cs': 'Belgie', 'csb': 'Belgijskô', 'cu': 'Бєлгїѥ', 'cv': 'Бельги', 'cy': 'Gwlad Belg', 'da': 'Belgien', 'diq': 'Belçıka', 'dv': 'ބެލްޖިއަމް', 'dz': 'བེལ་ཇིཡམ', 'ee': 'Belgium', 'el': 'Βέλγιο', 'eml': 'Bélgi', 'et': 'Belgia', 'eu': 'Belgika', 'ext': 'Bélgica', 'fa': 'بلژیک', 'fi': 'Belgia', 'fo': 'Belgia', 'frp': 'Bèlg·ique', 'frr': 'Bälgien', 'fur': 'Belgjo', 'fy': 'Belgje', 'ga': 'An Bheilg', 'gag': 'Belgiya', 'gd': "A' Bheilg", 'gl': 'Bélxica', 'gn': 'Véyhika', 'gv': 'Yn Velg', 'hak': 'Pí-li-sṳ̀', 'haw': 'Pelekiuma', 'he': 'בלגיה', 'hi': 'बेल्जियम', 'hif': 'Belgium', 'hr': 'Belgija', 'ht': 'Bèljik', 'hu': 'Belgium', 'hy': 'Բելգիա', 'ia': 'Belgica', 'id': 'Belgia', 'ie': 'Belgia', 'ilo': 'Belhika', 'io': 'Belgia', 'is': 'Belgía', 'jbo': 'beldjym', 'jv': 'Bélgié', 'ka': 'ბელგია', 'kaa': 'Belgiya', 'kab': 'Biljik', 'kbd': 'Белгэ', 'kg': 'Belezi', 'kk': 'Бельгия', 'kl': 'Belgia', 'ko': '벨기에', 'koi': 'Белгия', 'krc': 'Бельгия', 'ksh': 'Belgien', 'ku': 'Belgiya', 'kv': 'Бельгия', 'kw': 'Pow Belg', 'ky': 'Бельгия', 'lad': 'Beljika', 'lb': 'Belsch', 'lez': 'Бельгия', 'li': 'Belsj', 'lij': 'Belgio', 'lmo': 'Belgi', 'ln': 'Bɛ́ljika', 'lt': 'Belgija', 'ltg': 'Beļgeja', 'lv': 'Beļģija', 'mdf': 'Бельгие', 'mg': 'Belzika', 'mhr': 'Бельгий', 'mi': 'Pehiamu', 'ml': 'ബെൽജിയം', 'mn': 'Бельги', 'mr': 'बेल्जियम', 'ms': 'Belgium', 'mt': 'Belġju', 'my': 'ဘယ်လ်ဂျီယမ်နိုင်ငံ', 'na': 'Berdjiyum', 'nah': 'Belgica', 'nap': 'Belge', 'nds': 'Belgien', 'ne': 'बेल्जियम', 'new': 'बेल्जियम', 'nn': 'Belgia', 'nov': 'Belgia', 'nrm': 'Belgique', 'nv': 'Bélgii Bikéyah', 'oc': 'Belgica', 'or': 'ବେଲଜିଅମ', 'os': 'Бельги', 'pa': 'ਬੈਲਜੀਅਮ', 'pam': 'Belhika', 'pap': 'Bélgika', 'pcd': 'Bergike', 'pdc': 'Belgien', 'pfl': 'Belgje', 'pih': 'Beljum', 'pms': 'Belgi', 'pnb': 'بیلجیم', 'pnt': 'Βέλγιον', 'ps': 'بلجیم', 'qu': 'Bilhika', 'rm': 'Belgia', 'rmy': 'Beljiya', 'ro': 'Belgia', 'rue': 'Белґія', 'rw': 'Ububiligi', 'sa': 'बेल्जियम्', 'sah': 'Бельгия', 'scn': 'Belgiu', 'sco': 'Belgium', 'se': 'Belgia', 'sg': 'Bêleze', 'sh': 'Belgija', 'sk': 'Belgicko', 'sl': 'Belgija', 'so': 'Beljim', 'sq': 'Belgjika', 'sr': 'Белгија', 'srn': 'Belgikondre', 'ss': 'IBhelijiyamu', 'stq': 'Belgien', 'su': 'Bélgia', 'sv': 'Belgien', 'sw': 'Ubelgiji', 'szl': 'Belgijo', 'ta': 'பெல்ஜியம்', 'te': 'బెల్జియం', 'tet': 'Béljika', 'tg': 'Белгия', 'th': 'ประเทศเบลเยียม', 'tk': 'Belgiýa', 'tl': 'Belgium', 'tpi': 'Beljiam', 'tr': 'Belçika', 'tt': 'Бельгия', 'tum': 'Belgium', 'udm': 'Бельгия', 'ug': 'بېلگىيە', 'uk': 'Бельгія', 'ur': 'بلجئیم', 'uz': 'Belgiya', 'vec': 'Belgio', 'vep': "Bel'gii", 'vi': 'Bỉ', 'vls': 'België', 'vo': 'Belgän', 'war': 'Belhika', 'wo': 'Belsik', 'wuu': '比利时', 'xal': 'Бельҗмудин Нутг', 'xmf': 'ბელგია', 'yi': 'בעלגיע', 'yo': 'Bẹ́ljíọ̀m', 'zea': 'Belhië', 'zh': '比利时', 'zh-cn': '比利时', 'zh-hans': '比利 时', 'zh-sg': '比利时', 'zh-my': '比利时', 'zh-hk': '比利時', 'zh-tw': '比利時', 'zh-mo': '比利時', 'de-ch': 'Belgien', 'pt-br': 'Bélgica', 'sma': 'Belgia', 'liv': 'Beļgij', 'gu': 'બેલ્જિયમ', 'tokipona': 'ma Pesije', 'sr-ec': 'Белгија', 'sr-el': 'Belgija', 'lo': 'ປະເທດແບນຊິກ', 'crh-latn': 'Belçika', 'gsw': 'Belgie', 'ha': 'Beljik', 'si': 'බෙල්ජියම', 'rn': 'Ububirigi', 'sn': 'Belgium', 'ff': 'Beljik', 'lzh': '比利時', 'tw': 'Belgium', 'sc': 'Bèlgiu', 'cbk-zam': 'Bélgica', 'om': 'Beeljiyeem', 'mzn': 'بلژیک', 'ty': 'Peretita', 'bm': 'Bɛliziki', 'myv': 'Бельгия Мастор', 'pag': 'Belhika', 'lrc': 'بلجیک', 'azb': 'بلژیک', 'av': 'Бельгия', 'gom': 'बेल्जियम', 'pi': 'बेल्जियम', 'ady': 'Белгие', 'sm': 'Peleseuma', 'ts': 'Belgium', 'ig': 'Belgium', 'jam': 'Beljiom', 'bgn': 'بلجیم', 'kn': 'ಬೆಲ್ಜಿಯಂ', 'sd': 'بيلجيم', 'bho': 'बेल्जियम', 'olo': "Bel'gii"}, 'descriptions': {'en': 'constitutional monarchy in Western Europe', 'en-gb': 'country in Europe', 'fr': "pays d'Europe", 'it': "Stato dell'Europa occidentale, membro dell'Unione europea", 'nb': 'land i Europa', 'ru': 'страна в Западной Европе', 'es': 'país de Europa', 'nl': 'federale staat in West-Europa', 'de': 'Staat in Westeuropa', 'zh-hant': '西歐國家', 'zh-hans': '西欧国家', 'zh-cn': '西欧国家', 'zh-sg': '西欧国家', 'zh-my': '西欧国家', 'zh': '西 欧国家', 'zh-hk': '西歐國家', 'zh-tw': '西歐國家', 'zh-mo': '西歐國家', 'ca': "país d'Europa", 'fi': 'valtio Länsi-Euroopassa', 'ilo': 'pederal nga estado idiay Lumaud nga Europa', 'cs': 'stát v Evropě', 'la': 'civitas Europae', 'pt-br': 'país da Europa', 'ta': 'மேற்கு ஐரோப்பிய நாடு', 'pt': 'país da Europa', 'sk': 'štát v Europe', 'ja': '西ヨーロッパに位置する国家', 'eo': 'federacio en Eŭropo', 'br': 'stad Europa', 'da': 'Et land i Europa', 'tet': 'rai iha Europa', 'sr': 'држава у западној Европи', 'sr-ec': 'држава у западној Европи', 'sr-el': 'država u zapadnoj Evropi', 'cy': 'wlad yn Ewrop', 'uk': 'держава у Західній Європі', 'hu': 'állam Nyugat-Európában', 'ro': 'stat în Europa de Vest', 'el': 'χώρα της δυτικής Ευρώπης', 'ne': 'युरोपको देश', 'mk': 'земја во Европа', 'scn': "paìsi di l'Europa di punenti", 'hy': 'դաշնային թագավորություն Եվրոպայում', 'jv': 'nagara ing Éropah', 'ko': '유럽에 있는 나라', 'pa': "ਯੂਰਪ 'ਚ ਦੇਸ਼", 'sv': 'konstitutionell monarki i Västeuropa', 'kn': 'ಪಶ್ಚಿಮ ಯುರೋಪಿನ ಸಾಂವಿಧಾನಿಕ ರಾಜಾಡಳಿತ ಹೊಂದಿರುವ ದೇಶ', 'vls': "pays d'europe", 'hsb': 'stat w zapadnej Europje'}, 'aliases': {'en': ['Kingdom of Belgium', 'be'], 'be-tarask': ['Бэльґія'], 'nan': ['Belgie', 'Pí-lī-sî', 'België', 'Belgium'], 'roa-tara': ['Belge', 'Belgio'], 'yue': ['Koninkrijk België', 'Belgien', 'België', 'Royaume de Belgique', 'Belgium', 'Königreich Belgien', 'Belgique'], 'nds-nl': ['Belgie'], 'fr': ['Royaume de Belgique', 'belgium'], 'nl': ['Koninkrijk België', 'Belgie', 'BE'], 'en-gb': ['Southern Netherlands'], 'zh-hant': ['比利時王國'], 'pam': ['Belgium', 'Belika'], 'lij': ['Belgiò'], 'zh-hans': ['比利时王国'], 'zh-cn': ['比利时王国'], 'zh-sg': ['比利时王国'], 'zh-my': ['比利时王国'], 'zh': ['比利时王国'], 'zh-hk': ['比利時王國'], 'zh-tw': ['比利時王國'], 'zh-mo': ['比利時王國'], 'ca': ['Estat belga', 'België', 'Regne de Bèlgica', 'Bélgica'], 'nb': ['Kongeriket Belgia'], 'th': ['เบลเยียม', 'Belgium', 'เบลเยี่ยม', 'ประเทศเบลเยี่ยม', 'ราชอาณาจักรเบลเยียม', 'ราชอาณาจักรเบลเยี่ยม'], 'pt-br': ['Reino da Bélgica'], 'ta': ['பெல்சியம்', 'பெல்ஜிய பேரரசு'], 'fi': ['Belgian kuningaskunta'], 'ru': ['Belgique', 'Королевство Бельгия'], 'es': ['Reino de Bélgica'], 'el': ['Βασίλειο του Βελγίου'], 'mk': ['Кралство Белгија'], 'hy': ['Բելգիայի Թագավորություն'], 'jv': ['Kraton Bélgié'], 'ko': ['벨기에 왕국', '벨기에왕국'], 'tl': ['Belhika']}, 'claims': {'P1464': ['Q7463296'], 'P1036': ['2--493'], 'P138': ['Q206443'], 'P31': ['Q3624078', 'Q43702', 'Q185441', 'Q6256', 'Q160016', 'Q6505795'], 'P30': ['Q46'], 'P36': ['Q239'], 'P47': ['Q183', 'Q32', 'Q142', 'Q29999'], 'P37': ['Q7411', 'Q150', 'Q188'], 'P38': ['Q4916'], 'P78': ['Q39773'], 'P41': ['Flag of Belgium (civil).svg'], 'P85': ['Q161539'], 'P94': ['Great coat of arms of Belgium.svg'], 'P163': ['Q12990'], 'P297': ['BE'], 'P298': ['BEL'], 'P299': ['056'], 'P242': ['Europe location BEL.png'], 'P373': ['Belgium'], 'P227': ['4005406-8'], 'P402': ['52411'], 'P150': ['Q9337', 'Q240', 'Q9331', 'Q89959', 'Q90027', 'Q231'], 'P474': ['+32'], 'P122': ['Q41614', 'Q3330103'], 'P209': ['Q1755321', 'Q1230309', 'Q1761425'], 'P625': [[51, 5]], 'P35': ['Q155004'], 'P6': ['Q950958'], 'P214': ['144248059'], 'P208': ['Q390947'], 'P92': ['Q633629'], 'P610': ['Q322824'], 'P463': ['Q458', 'Q1065', 'Q7184', 'Q41550', 'Q8908', 'Q13116', 'Q42262', 'Q7825', 'Q141720', 'Q1542735', 'Q152299', 'Q151991', 'Q81299', 'Q191384', 'Q827525', 'Q656801', 'Q1043527', 'Q899770', 'Q340195', 'Q188822', 'Q782942', 'Q161549', 'Q1377612'], 'P856': ['http://www.belgium.be'], 'P910': ['Q4366768'], 'P948': ['Belgium Banner.jpg'], 'P237': ['Q199614'], 'P349': ['00560624'], 'P998': ['Regional/Europe/Belgium/'], 'P984': ['BEL'], 'P421': ['Q25989', 'Q6655', 'Q6723', 'Q207020'], 'P982': ['5b8a5ee5-0bb3-34cf-9a75-c27c44e341fc'], 'P17': ['Q31'], 'P194': ['Q1137059'], 'P646': ['/m/0154j'], 'P901': ['BE'], 'P268': ['15238382r'], 'P269': ['172396506'], 'P1151': ['Q3247091'], 'P571': ['1830-10-04T00:00:00.000Z'], 'P1566': ['2802361'], 'P1465': ['Q6334541'], 'P1082': [11150516], 'P1740': ['Q7522716'], 'P1791': ['Q7974978'], 'P1792': ['Q7021332'], 'P1313': ['Q213107'], 'P1549': ['Belg', 'Belgian', 'belga', 'Belgische', 'Belge', 'Belgier', 'belgier', 'Belgiano', 'belgo', 'Belgänan', 'பெல்ஜியர்'], 'P1842': ['Belgium'], 'P605': ['BE'], 'P1343': ['Q302556', 'Q2657718', 'Q4114391'], 'P530': ['Q32', 'Q38', 'Q183', 'Q347', 'Q408'], 'P244': ['n80126041'], 'P2184': ['Q205317'], 'P898': ['ˈbɛlgɪɑ', 'ˈbʲelʲɡʲɪjə'], 'P2258': ['206'], 'P935': ['België - Belgique'], 'P1589': [], 'P2131': [531546586178], 'P2163': ['249848'], 'P1417': ['place/Belgium'], 'P1622': ['Q14565199'], 'P2633': ['Q1115035'], 'P949': ['000981380'], 'P1198': [8], 'P2134': [25444420303], 'P2299': [43435], 'P2884': [230], 'P2852': ['Q1061257', 'Q25648793', 'Q25648794'], 'P2853': ['Q1378312', 'Q2335536'], 'P2927': [0.8], 'P1332': [[51.5, 4.77]], 'P3221': ['destination/belgium'], 'P3106': ['world/belgium'], 'P2959': ['Q25929919'], 'P3270': [6], 'P3271': [18], 'P2997': [18], 'P3238': ['0'], 'P3000': [18], 'P2046': [30528], 'P3348': ['2020'], 'P1081': [0.755, 0.774, 0.806, 0.851, 0.874, 0.866, 0.883, 0.886, 0.889, 0.888, 0.89], 'P3417': ['Belgium']}, 'sitelinks': {'enwikivoyage': 'Belgium', 'eswikivoyage': 'Bélgica', 'elwikivoyage': 'Βέλγιο', 'frwikivoyage': 'Belgique', 'itwikivoyage': 'Belgio', 'plwikivoyage': 'Belgia', 'ptwikivoyage': 'Bélgica', 'rowikivoyage': 'Belgia', 'ruwikivoyage': 'Бельгия', 'svwikivoyage': 'Belgien', 'ukwikivoyage': 'Бельгія', 'viwikivoyage': 'Bỉ', 'commonswiki': 'België - Belgique', 'zhwikivoyage': '比利时', 'enwikiquote': 'Belgium', 'frwikiquote': 'Belgique', 'hewikiquote': 'בלגיה', 'nlwikiquote': 'België', 'eswikiquote': 'Bélgica', 'itwikiquote': 'Belgio', 'plwikiquote': 'Belgia', 'enwiki': 'Belgium', 'dewiki': 'Belgien', 'frwiki': 'Belgique', 'eswiki': 'Bélgica', 'ruwiki': 'Бельгия', 'itwiki': 'Belgio', 'jawiki': 'ベルギー', 'nlwiki': 'België', 'plwiki': 'Belgia', 'ptwiki': 'Bélgica', 'zhwiki': '比利时', 'svwiki': 'Belgien', 'fawiki': 'بلژیک', 'hewiki': 'בלגיה', 'trwiki': 'Belçika', 'huwiki': 'Belgium', 'arwiki': 'بلجيكا', 'viwiki': 'Bỉ', 'nowiki': 'Belgia', 'ukwiki': 'Бельгія', 'kowiki': '벨기에', 'cawiki': 'Bèlgica', 'cswiki': 'Belgie', 'srwiki': 'Белгија', 'rowiki': 'Belgia', 'idwiki': 'Belgia', 'dawiki': 'Belgien', 'simplewiki': 'Belgium', 'bgwiki': 'Белгия', 'acewiki': 'Bèlgia', 'afwiki': 'België', 'alswiki': 'Belgien', 'amwiki': 'ቤልጅግ', 'anwiki': 'Belchica', 'angwiki': 'Belgice', 'arcwiki': 'ܒܠܓܝܩܐ', 'arzwiki': 'بلجيكا', 'astwiki': 'Bélxica', 'aywiki': 'Bilkiya', 'azwiki': 'Belçika', 'bawiki': 'Бельгия', 'bat_smgwiki': 'Belgėjė', 'bclwiki': 'Belhika', 'bewiki': 'Бельгія', 'be_x_oldwiki': 'Бэльгія', 'bhwiki': 'बेल्जियम', 'bnwiki': 'বেলজিয়াম', 'bowiki': 'པེར་ཅིན།', 'bpywiki': 'বেলজিয়াম', 'brwiki': 'Belgia', 'bswiki': 'Belgija', 'bugwiki': 'Belgia', 'bxrwiki': 'Бельги', 'cbk_zamwiki': 'Bélgica', 'cdowiki': 'Bī-lé-sì', 'cewiki': 'Бельги', 'cebwiki': 'Belhika', 'chrwiki': 'ᏇᎵᏥᎥᎻ', 'ckbwiki': 'بەلجیکا', 'cowiki': 'Belgica', 'crhwiki': 'Belçika', 'csbwiki': 'Belgijskô', 'cuwiki': 'Бєлгїѥ', 'cvwiki': 'Бельги', 'cywiki': 'Gwlad Belg', 'diqwiki': 'Belçıka', 'dsbwiki': 'Belgiska', 'dvwiki': 'ބެލްޖިއަމް', 'dzwiki': 'བེལ་ཇིཡམ', 'eewiki': 'Belgium', 'elwiki': 'Βέλγιο', 'emlwiki': 'Bélgi', 'eowiki': 'Belgio', 'etwiki': 'Belgia', 'euwiki': 'Belgika', 'extwiki': 'Bélgica', 'ffwiki': 'Beljik', 'fiu_vrowiki': 'Belgiä', 'fowiki': 'Belgia', 'frpwiki': 'Bèlg·ique', 'furwiki': 'Belgjo', 'fywiki': 'Belgje', 'gawiki': 'An Bheilg', 'gagwiki': 'Belgiya', 'gdwiki': "A' Bheilg", 'glwiki': 'Bélxica', 'gnwiki': 'Véyhika', 'guwiki': 'બેલ્જિયમ', 'gvwiki': 'Yn Velg', 'hawiki': 'Beljik', 'hakwiki': 'Pí-li-sṳ̀', 'hawwiki': 'Pelekiuma', 'hiwiki': 'बेल्जियम', 'hifwiki': 'Belgium', 'hrwiki': 'Belgija', 'hsbwiki': 'Belgiska', 'htwiki': 'Bèljik', 'hywiki': 'Բելգիա', 'iawiki': 'Belgica', 'iewiki': 'Belgia', 'ilowiki': 'Belhika', 'iowiki': 'Belgia', 'iswiki': 'Belgía', 'jbowiki': 'beldjym', 'kawiki': 'ბელგია', 'kaawiki': 'Belgiya', 'kabwiki': 'Biljik', 'kbdwiki': 'Белгиэ', 'kgwiki': 'Belezi', 'kkwiki': 'Бельгия', 'klwiki': 'Belgia', 'koiwiki': 'Белгия', 'krcwiki': 'Бельгия', 'kshwiki': 'Belgien', 'kuwiki': 'Belçîka', 'kvwiki': 'Бельгия', 'kwwiki': 'Pow Belg', 'kywiki': 'Бельгия', 'lawiki': 'Belgica', 'ladwiki': 'Beljika', 'lbwiki': 'Belsch', 'lezwiki': 'Бельгия', 'liwiki': 'Belsj', 'lijwiki': 'Belgio', 'lmowiki': 'Belgi', 'lnwiki': 'Bɛ́ljika', 'lowiki': 'ປະເທດແບນຊິກ', 'ltwiki': 'Belgija', 'ltgwiki': 'Beļgeja', 'lvwiki': 'Beļģija', 'mdfwiki': 'Бельгие', 'mgwiki': 'Belzika', 'mhrwiki': 'Бельгий', 'miwiki': 'Pehiamu', 'mkwiki': 'Белгија', 'mlwiki': 'ബെൽജിയം', 'mnwiki': 'Бельги', 'mrwiki': 'बेल्जियम', 'mswiki': 'Belgium', 'mtwiki': 'Belġju', 'mywiki': 'ဘယ်လ်ဂျီယမ်နိုင်ငံ', 'mznwiki': 'بلژیک', 'nawiki': 'Berdjiyum', 'nahwiki': 'Belgica', 'napwiki': 'Belge', 'ndswiki': 'Belgien', 'nds_nlwiki': 'België', 'newiki': 'बेल्जियम', 'newwiki': 'बेल्जियम', 'nnwiki': 'Belgia', 'novwiki': 'Belgia', 'nrmwiki': 'Belgique', 'nvwiki': 'Bélgii Bikéyah', 'ocwiki': 'Belgica', 'omwiki': 'Beeljiyeem', 'orwiki': 'ବେଲଜିଅମ', 'oswiki': 'Бельги', 'pawiki': 'ਬੈਲਜੀਅਮ', 'pamwiki': 'Belgika', 'papwiki': 'Bélgika', 'pcdwiki': 'Bergike', 'pdcwiki': 'Belgien', 'pihwiki': 'Beljum', 'pmswiki': 'Belgi', 'pnbwiki': 'بیلجیم', 'pntwiki': 'Βέλγιον', 'quwiki': 'Bilhika', 'rmwiki': 'Belgia', 'rmywiki': 'Beljiya', 'rnwiki': 'Ububirigi', 'roa_tarawiki': 'Bèlge', 'ruewiki': 'Белґія', 'rwwiki': 'Ububiligi', 'sawiki': 'बेल्जियम्', 'sahwiki': 'Бельгия', 'scwiki': 'Bèlgiu', 'scnwiki': 'Belgiu', 'scowiki': 'Belgium', 'sewiki': 'Belgia', 'sgwiki': 'Bêleze', 'shwiki': 'Belgija', 'siwiki': 'බෙල්ජියම', 'skwiki': 'Belgicko', 'slwiki': 'Belgija', 'snwiki': 'Belgium', 'sowiki': 'Beljim', 'sqwiki': 'Belgjika', 'srnwiki': 'Belgikondre', 'sswiki': 'IBhelijiyamu', 'stqwiki': 'Belgien', 'suwiki': 'Bélgia', 'swwiki': 'Ubelgiji', 'szlwiki': 'Belgijo', 'tawiki': 'பெல்ஜியம்', 'tewiki': 'బెల్జియం', 'tetwiki': 'Béljika', 'tgwiki': 'Белгия', 'thwiki': 'ประเทศเบลเยียม', 'tkwiki': 'Belgiýa', 'tlwiki': 'Belhika', 'tpiwiki': 'Beljiam', 'ttwiki': 'Бельгия', 'tumwiki': 'Belgium', 'twwiki': 'Belgium', 'udmwiki': 'Бельгия', 'ugwiki': 'بېلگىيە', 'urwiki': 'بلجئیم', 'uzwiki': 'Belgiya', 'vecwiki': 'Belgio', 'vepwiki': "Bel'gii", 'vlswiki': 'België', 'vowiki': 'Belgän', 'wawiki': 'Beldjike', 'warwiki': 'Belhika', 'wowiki': 'Belsik', 'wuuwiki': '比利时', 'xalwiki': 'Бельҗмудин Нутг', 'xmfwiki': 'ბელგია', 'yiwiki': 'בעלגיע', 'yowiki': 'Bẹ́ljíọ̀m', 'zeawiki': 'Belhië', 'zh_classicalwiki': '比利時', 'zh_yuewiki': '比利時', 'nlwikinews': 'België', 'svwikinews': 'Belgien', 'fawikivoyage': 'بلژیک', 'fiwiki': 'Belgia', 'dewikivoyage': 'Belgien', 'nlwikivoyage': 'België', 'bmwiki': 'Bɛliziki', 'myvwiki': 'Бельгия Мастор', 'ptwikibooks': 'Bélgica', 'biwiki': 'Beljiom', 'pagwiki': 'Belhika', 'azbwiki': 'بلژیک', 'gomwiki': 'बेल्जियम', 'avwiki': 'Бельгия', 'pflwiki': 'Belgien', 'piwiki': 'बेल्जियम', 'hywikiquote': 'Բելգիա', 'frrwiki': 'Belgien', 'barwiki': 'Bejgien', 'pswiki': 'بېلجیم', 'adywiki': 'Белгие', 'smwiki': 'Peleseuma', 'tswiki': 'Belgium', 'igwiki': 'Belgium', 'lrcwiki': 'بلجیک', 'ruwikisource': 'Бельгия', 'jamwiki': 'Beljiom', 'frwikinews': 'Catégorie:Belgique', 'hewikivoyage': 'בלגיה', 'sdwiki': 'بيلجيم', 'knwiki': 'ಬೆಲ್ಜಿಯಂ', 'zh_min_nanwiki': 'Pe̍k-ní-gī', 'jvwiki': 'Bèlgi', 'olowiki': "Bel'gii", 'fiwikivoyage': 'Belgia', 'enwikibooks': 'Pinyin/Belgium'}}

BTW at first it was reported that UnicodeDecodeError: 'gbk' codec can't decode byte xxx... So I modified line 159 in dump_to_es.py. Now it is with open(self.dump_path, 'r', encoding='utf-8') as f:.

kdutia commented 3 years ago

I've figured out what's wrong: the code looks for item[lang]['value'] for each language in doc['labels'], meaning it expects each value of doc['labels'] to be a dict with key 'value'. In your example each value is just a string.

I'm hesitant to fix this in the codebase until I've worked out why your data samples are different from what I extracted from wikibase-dump-filter, but you should be able to fix it by making this replacement as follows:

# old lines
if lang in doc.get("labels", {}):
    newdoc["labels"] = doc["labels"][lang]["value"]

# new lines
if lang in doc.get("labels", {}):
    newdoc["labels"] = doc["labels"][lang]

Thanks for the utf-8 fix, I'll add it to the next release 🙂

xzhaoyooo commented 3 years ago

I've figured out what's wrong: the code looks for item[lang]['value'] for each language in doc['labels'], meaning it expects each value of doc['labels'] to be a dict with key 'value'. In your example each value is just a string.

I'm hesitant to fix this in the codebase until I've worked out why your data samples are different from what I extracted from wikibase-dump-filter, but you should be able to fix it by making this replacement as follows:

# old lines
if lang in doc.get("labels", {}):
    newdoc["labels"] = doc["labels"][lang]["value"]

# new lines
if lang in doc.get("labels", {}):
    newdoc["labels"] = doc["labels"][lang]

Thanks for the utf-8 fix, I'll add it to the next release 🙂

Anytime. 😄 I've no idea if it's because I opened "--simplify" in wikibase-dump-filter, I modified the code but still there were more sentences to change. I'll try the filter again without that option to see.

kdutia commented 3 years ago

The --simplify flag may well be it! Let me know what you find.

xzhaoyooo commented 3 years ago

The --simplify flag may well be it! Let me know what you find.

Hi Dutia,

I think I've figured out the reason. Yes, it was exactly the --simplify flag. I filtered one Wikidata dump again with only --keep flag set, and it worked well.

It's because --simplify flag would flat the nested dict object into just one string. It's like:

{
    {
        'labels': {
            'en': 'Belgium'
        }
    }
}

While the origin format is like:

{
    {
        'labels': {
            'en': {
                'value': '...',
                '...': '...'
            }
        }
    }
}

Maybe you can also add support to simplified dump in the next release. XD

Thanks a lot for your help, wish you have a nice day!

kdutia commented 3 years ago

Makes sense, thanks for helping me sort this out. I'll add a note to the readme for now and maybe add the ability to run from a simplified dump in a future release.