Open fabriziorizzo opened 2 years ago
@fabriziorizzo thanks for reporting the issue.
There are two issues actually:
Spanish (ES) is supported but I introduced a regression (#32 ) some time ago that I noticed thanks to your comment - could you please try to check out this PR branch https://github.com/diegoceccarelli/json-wikipedia/tree/language, compile and check if it fixes?
French is not supported - and you are right, I should add documentation on how to add a new language! I'll do. In order to support a new language you have to:
Provide the mapping of the xml-wikipedia dump in that particular language (e.g., what is the keyword use to indicate a disambiguation page in French? what is the keyword to indicate a redirect, etc). You provide the mapping by writing a property file called locale-fr.properties
and putting it in the lang folder, like for example: https://github.com/diegoceccarelli/json-wikipedia/blob/language/src/main/resources/lang/locale-es.properties
Once you added the property file into the folder open [src/main/avro/article.avsc]() and add FR
to the list of languages as I did for ES
in #61.
Please let me know if it works, and, if you write it, it would be great if you can contribute French. Cheers
Can you add documentation to the readme on how to sufficiently extend this solution to other languages? FR and ES did not work. E.g... $ ./scripts/convert-xml-dump-to-json.sh fr /u01/wikip/dumps.wikipedia/frwiki/frwiki-latest-pages-articles.xml.bz2 ./frwiki-latest-pages-articles.json
Converting mediawiki xml dump to json dump (./frwiki-latest-pages-articles.json) 2021-12-15 00:25:50,990 1086 [main] ERROR it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI - Parsing the mediawiki java.lang.IllegalArgumentException: No enum constant it.cnr.isti.hpc.wikipedia.article.Language.FR at java.base/java.lang.Enum.valueOf(Enum.java:240) ~[na:na] at it.cnr.isti.hpc.wikipedia.article.Language.valueOf(Language.java:8) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na] at it.cnr.isti.hpc.wikipedia.parser.ArticleParser.(ArticleParser.java:66) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at it.cnr.isti.hpc.wikipedia.reader.WikipediaArticleReader.(WikipediaArticleReader.java:95) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI.call(MediawikiToJsonCLI.java:55) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI.call(MediawikiToJsonCLI.java:27) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine.executeUserObject(CommandLine.java:1953) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine.access$1300(CommandLine.java:145) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at picocli.CommandLine.execute(CommandLine.java:2078) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI.main(MediawikiToJsonCLI.java:65) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]