Wikidata / Wikidata-Toolkit-Examples

Examples showing how to use Wikidata Toolkit as a Maven library in your project
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
49 stars 23 forks source link

JSON dump parsing error #5

Closed GeorgeSigoiu closed 4 months ago

GeorgeSigoiu commented 1 year ago

Hello,

I cloned this project and run 'EntityStatisticsProcessor' class and this is the error that occurs

********************************************************************
*** Wikidata Toolkit: EntityStatisticsProcessor
*** 
*** This program will download and process dumps from Wikidata.
*** It will print progress information and some simple statistics.
*** Results about property usage will be stored in a CSV file.
*** See source code for further details.
********************************************************************
2022-11-23 11:53:46 INFO  - Using download directory C:\_optimaize\wikidata\Wikidata-Toolkit-Examples\dumpfiles\wikidatawiki
2022-11-23 11:53:46 INFO  - Found 0 local dumps of type JSON: []
2022-11-23 11:53:49 INFO  - Found 360 online dumps of type JSON: [wikidatawiki-json-20221123, wikidatawiki-json-20221116, wikidatawiki-json-20221114, ..., wikidatawiki-json-20170925]
2022-11-23 11:53:49 INFO  - Downloading JSON dump file 20221123.json.gz from https://dumps.wikimedia.org/other/wikidata/20221123.json.gz ...

2022-11-23 11:55:35 ERROR - Error when reading JSON for entity: Missing type id when trying to resolve subtype of [simple type, class org.wikidata.wdtk.datamodel.implementation.FormDocumentImpl]: missing type id property 'type' (for POJO property 'forms')
 at [Source: (GZIPInputStream); line: 2, column: 1492] (through reference chain: org.wikidata.wdtk.datamodel.implementation.LexemeDocumentImpl["forms"]->java.util.ArrayList[0])
2022-11-23 11:55:35 WARN  - Entering recovery mode to parse rest of file. This might be slightly slower.
2022-11-23 11:55:35 WARN  - Skipping rest of current line: BA2239BB8","rank":"normal"}]}}],"pageid":54387040,"ns":146,"title":"Lexeme:L4","lastrevid":171059607[...]id":1710596079,"modified":"2022-08-22T19:28:34Z"},
2022-11-23 11:55:35 ERROR - Error when reading JSON for entity: Missing type id when trying to resolve subtype of [simple type, class org.wikidata.wdtk.datamodel.implementation.FormDocumentImpl]: missing type id property 'type' (for POJO property 'forms')
 at [Source: (String)"{"type":"lexeme","id":"L314","lemmas":{"ca":{"language":"ca","value":"pi"}},"lexicalCategory":"Q1084","language":"Q7026","claims":{"P5185":[{"mainsnak":{"snaktype":"value","property":"P5185","datavalue":{"value":{"entity-type":"item","numeric-id":1775415,"id":"Q1775415"},"type":"wikibase-entityid"},"datatype":"wikibase-item"},"type":"statement","id":"L314$45650151-4ed8-025d-2442-e36ef22e6a2a","rank":"normal"}]},"forms":[{"id":"L314-F1","representations":{"ca":{"language":"ca","value":"pis"}},"gr"[truncated 281 chars]; line: 1, column: 543] (through reference chain: org.wikidata.wdtk.datamodel.implementation.LexemeDocumentImpl["forms"]->java.util.ArrayList[0])
2022-11-23 11:55:35 ERROR - Problematic line was: {"type":"lexeme","id":"L314","lemmas":{"ca":{"lang...

Project is using wikidata toolkit version 0.11.0.

I tried newer versions (https://mvnrepository.com/artifact/org.wikidata.wdtk/wdtk-datamodel) and got another error

2022-11-23 12:05:48 ERROR - Error when reading JSON for entity: Cannot deserialize value of type `java.util.ArrayList<org.wikidata.wdtk.datamodel.implementation.SenseDocumentImpl>` from Object value (token `JsonToken.START_OBJECT`)
 at [Source: (GZIPInputStream); line: 3, column: 674] (through reference chain: org.wikidata.wdtk.datamodel.implementation.LexemeDocumentImpl["senses"])
2022-11-23 12:05:48 WARN  - Entering recovery mode to parse rest of file. This might be slightly slower.
2022-11-23 12:05:48 WARN  - Skipping rest of current line: id":"L117$2bc66535-41a6-a4ca-3748-060ad3bbe56c","rank":"normal"}],"P1343":[{"mainsnak":{"snaktype":"[...]id":1742289833,"modified":"2022-10-03T18:45:08Z"},
2022-11-23 12:05:48 ERROR - Error when reading JSON for entity: Cannot deserialize value of type `java.util.ArrayList<org.wikidata.wdtk.datamodel.implementation.FormDocumentImpl>` from Object value (token `JsonToken.START_OBJECT`)
 at [Source: (String)"{"type":"lexeme","id":"L68","lemmas":{"fa":{"language":"fa","value":"\u062c\u0627\u0646\u0627\u0646"}},"lexicalCategory":"Q1084","language":"Q9168","claims":{},"forms":{},"senses":{},"pageid":54387656,"ns":146,"title":"Lexeme:L68","lastrevid":683797031,"modified":"2018-05-23T11:27:17Z"}"; line: 1, column: 169] (through reference chain: org.wikidata.wdtk.datamodel.implementation.LexemeDocumentImpl["forms"])
2022-11-23 12:05:48 ERROR - Problematic line was: {"type":"lexeme","id":"L68","lemmas":{"fa":{"langu...

i guess the data is downloaded from: https://dumps.wikimedia.org/other/wikidata/ (this link appeared in console)

What could i do in order to make this work?

TheEaterr commented 4 months ago

Putting this here for anyone who might also run into this issue (as I have).

If I understood correctly, automatically only the last incremental dump is downloaded, and when it is parsed by the tool it fails as it is lacking some reference. So the full dump needs to be downloaded either manually or by setting value in the ExampleHelper.java.

Also, this repository is outdated (hasn't been updated in 5 years). But the examples are still in the upstream repository (https://github.com/Wikidata/Wikidata-Toolkit) and maintained there so the main repository should probably be what is used.

wetneb commented 4 months ago

I have updated the examples to use the latest released version of WDTK, which should solve this issue. @TheEaterr can you confirm?

TheEaterr commented 4 months ago

Yes there is no error now (though the script does "nothing" as it only downloads a recent diff but there's not much that can reasonably be done about that other than maybe adding a logging message).

Thank you !