diegoceccarelli / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump
Apache License 2.0
252 stars 42 forks source link

how to convert dumps other than EN or IT? #60

Open fabriziorizzo opened 2 years ago

fabriziorizzo commented 2 years ago

Can you add documentation to the readme on how to sufficiently extend this solution to other languages? FR and ES did not work. E.g... $ ./scripts/convert-xml-dump-to-json.sh fr /u01/wikip/dumps.wikipedia/frwiki/frwiki-latest-pages-articles.xml.bz2 ./frwiki-latest-pages-articles.json

Converting mediawiki xml dump to json dump (./frwiki-latest-pages-articles.json) 2021-12-15 00:25:50,990 1086 [main] ERROR it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI - Parsing the mediawiki java.lang.IllegalArgumentException: No enum constant it.cnr.isti.hpc.wikipedia.article.Language.FR at java.base/java.lang.Enum.valueOf(Enum.java:240) ~[na:na] at it.cnr.isti.hpc.wikipedia.article.Language.valueOf(Language.java:8) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at it.cnr.isti.hpc.wikipedia.parser.ArticleParser.(ArticleParser.java:66) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at it.cnr.isti.hpc.wikipedia.reader.WikipediaArticleReader.(WikipediaArticleReader.java:95) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI.call(MediawikiToJsonCLI.java:55) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI.call(MediawikiToJsonCLI.java:27) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine.executeUserObject(CommandLine.java:1953) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine.access$1300(CommandLine.java:145) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at picocli.CommandLine.execute(CommandLine.java:2078) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI.main(MediawikiToJsonCLI.java:65) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]

diegoceccarelli commented 2 years ago

@fabriziorizzo thanks for reporting the issue.

There are two issues actually:

Spanish (ES) is supported but I introduced a regression (#32 ) some time ago that I noticed thanks to your comment - could you please try to check out this PR branch https://github.com/diegoceccarelli/json-wikipedia/tree/language, compile and check if it fixes?

French is not supported - and you are right, I should add documentation on how to add a new language! I'll do. In order to support a new language you have to:

  1. Provide the mapping of the xml-wikipedia dump in that particular language (e.g., what is the keyword use to indicate a disambiguation page in French? what is the keyword to indicate a redirect, etc). You provide the mapping by writing a property file called locale-fr.properties and putting it in the lang folder, like for example: https://github.com/diegoceccarelli/json-wikipedia/blob/language/src/main/resources/lang/locale-es.properties

  2. Once you added the property file into the folder open [src/main/avro/article.avsc]() and add FR to the list of languages as I did for ES in #61.

Please let me know if it works, and, if you write it, it would be great if you can contribute French. Cheers