Wikidata / Wikidata-Toolkit

Java library to interact with Wikibase
https://www.mediawiki.org/wiki/Wikidata_Toolkit
Apache License 2.0
373 stars 100 forks source link

Negative dates with low precision lack a leading zero in RDF export #128

Open mkroetzsch opened 9 years ago

mkroetzsch commented 9 years ago

When exporting a date like "4th century BC" (see https://www.wikidata.org/wiki/Q4318) leads to wrongly formatted XSD literals "-400" rather than "-0400". The reason for this is that the formatting method for negative years in TimeValueConverter.java is only invoked for dates with at least year precision (which is done since the formatting code right now is coupled to the year-zero correction code that is only meaningful for year-level precision. The fix for this will be to separate year-zero correction from negative-year formatting and have two if-statements instead.

congwang-ai commented 9 years ago

I can help fix this bug (actually I reported this one). It would be great if you can pinpoint the java class needed to be modified.

There're also other bugs, where some triples contain "\110" or "\", etc, which would invoke some encoding issues.

mkroetzsch commented 9 years ago

Great, thanks. The issue is caused by this code.

I don't know what could cause the encoding issues, since we use OpenRDF for character encoding, and it should make sure that all string content is valid. Best open another bug for this if you have more information.

congwang-ai commented 9 years ago

OK. I ll open another issue about character encoding later this evening.

mkroetzsch commented 9 years ago

The year-zero correction should be removed from our export, since it was converting XML Schema 1.1 to XML Schema 1.0 format. First of all, we want our exports to conform to XSD 1.1 (and thus to RDF 1.1). Secondly, it is currently unclear if the date encoding in the JSON is in XSD 1.1 or XSD 1.0 or in a mix of the two :-(. WMDE is working on clarifying how dates can be restored to follow a standard, but right now historic Wikidata dates should not be considered to be exact to the year, esp. in the BCE range.

Moreover, note that there is now issue #133 for tracking the character encoding issue.