Open jbenton-adc opened 9 years ago
+1
Other syntax errors in this file are on lines 1947033, 2245904, 2305615, 4391674. To fix it easily, use e.g. the variations on sed -i -e '4391674s/^/#/' short_abstracts_en.nt
.
From the next release we will switch to the ttl files that do not have this problems
@Vehnem is this fixed/validated in the recent releases?
dbpedia 2014 dataset short_abstracts_en file downloaded from http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 on 9/29/2014
wget http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 bunzip2 short_abstracts_en.nt.bz2 head -n 1263475 short_abstracts_en.nt | tail > parse_error.nt arq --strict --data parse_error.nt --query query.rq 08:53:18 ERROR riot :: [line: 8, col: 122] Not a hexadecimal character: Failed to load data
This seems to be the triple that is causing the problem: http://dbpedia.org/resource/Taiwanese_kana http://www.w3.org/2000/01/rdf-schema#comment "Taiwanese kana (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) is a katakana-based writing system once used to write Holo Taiwanese, when Taiwan was ruled by Japan. It functioned as a phonetic guide to hanzi, much like furigana in Japanese or Zhuyin fuhao in Chinese. There were similar systems for other languages in Taiwan as well, including Hakka and Formosan languages.The system was imposed by Japan at the time, and used in a few dictionaries, as well as textbooks."@en .
"\u30A " is not valid unicode