dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
850 stars 270 forks source link

invalid hexadecimal characters in short_abstracts_en #273

Open jbenton-adc opened 9 years ago

jbenton-adc commented 9 years ago

dbpedia 2014 dataset short_abstracts_en file downloaded from http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 on 9/29/2014

wget http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 bunzip2 short_abstracts_en.nt.bz2 head -n 1263475 short_abstracts_en.nt | tail > parse_error.nt arq --strict --data parse_error.nt --query query.rq 08:53:18 ERROR riot :: [line: 8, col: 122] Not a hexadecimal character: Failed to load data

This seems to be the triple that is causing the problem: http://dbpedia.org/resource/Taiwanese_kana http://www.w3.org/2000/01/rdf-schema#comment "Taiwanese kana (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) is a katakana-based writing system once used to write Holo Taiwanese, when Taiwan was ruled by Japan. It functioned as a phonetic guide to hanzi, much like furigana in Japanese or Zhuyin fuhao in Chinese. There were similar systems for other languages in Taiwan as well, including Hakka and Formosan languages.The system was imposed by Japan at the time, and used in a few dictionaries, as well as textbooks."@en .

"\u30A " is not valid unicode

Hronom commented 9 years ago

+1

mgns commented 9 years ago

Same reported here: http://stackoverflow.com/questions/26415922/why-do-i-get-not-a-hexadecimal-character-when-using-tdbloader2

pasky commented 8 years ago

Other syntax errors in this file are on lines 1947033, 2245904, 2305615, 4391674. To fix it easily, use e.g. the variations on sed -i -e '4391674s/^/#/' short_abstracts_en.nt.

jimkont commented 8 years ago

From the next release we will switch to the ttl files that do not have this problems

m1ci commented 4 years ago

@Vehnem is this fixed/validated in the recent releases?