dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

problems with unicode characters #342

Closed chile12 closed 8 years ago

chile12 commented 9 years ago

The extraction of the de-dump results in a lot of these: BAD URI: Illegal character in path at index 43: http://de.dbpedia.org/resource/Friedel_Tiek\\u00F6tter... Please have a look into this.

jcsahnwaldt commented 9 years ago

Could you provide a few more details? Or maybe post your extraction.xyz.properties file. Which extractors did you run? An .nt or .ttl file (or a part of one) would also be very helpful. Do these errors occur only in the .nt files or also in the .ttl files?

The double backslash in the error message is rather strange. Looks like the URI has been backslash-escaped twice.

chile12 commented 9 years ago

my properties file: base-dir=C:/Users/Chile/Desktop/testDumps

source=pages-articles.xml.bz2

languages=de

extractors=.MappingExtractor, .InfoboxExtractor

format.nt.bz2=n-triples;uri-policy.default

ontology=../ontology.xml mappings=../mappings

I can only report on .nt files, where uris look like this:

http://de.dbpedia.org/resource/\u03A9-Bromacetophenon ...

jcsahnwaldt commented 9 years ago

There's nothing wrong with http://de.dbpedia.org/resource/\u03A9-Bromacetophenon - it's the NT-escaped version of the IRI http://de.dbpedia.org/resource/Ω-Bromacetophenon .

Until recently, NT didn't allow non-ASCII chars, they had to be escaped. See #291 for details.

Please post excerpts from your NT files, especially a few full lines where "BAD URI" occurs. For example, I'd like to know whether they occur in subject or object position.

jcsahnwaldt commented 9 years ago

Also, please post your full properties file. The definition of uri-policy.default is missing in the excerpt above...

jcsahnwaldt commented 9 years ago

Which version of the code do you use? Latest master branch from github?

jimkont commented 8 years ago

fixed in latest version