dice-group / LIMES

Link Discovery Framework for Metric Spaces.
https://limes.demos.dice-research.org/
GNU Affero General Public License v3.0
126 stars 54 forks source link

"47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC" with DBpedia labels #234

Closed KonradHoeffner closed 4 years ago

KonradHoeffner commented 4 years ago

I get thousands of warnings like the ones below with DBpedia. Is that correct?

Target Configuration

  <TARGET>                                                                                                                                                                                                                                                                                                              
                <ID>dbpedia</ID>
                <ENDPOINT>en.ttl</ENDPOINT>
                <VAR>?dbpedia</VAR>
                <PAGESIZE>-1</PAGESIZE>
                <RESTRICTION>?dbpedia a owl:Thing</RESTRICTION>
                <PROPERTY>rdfs:label AS nolang RENAME label</PROPERTY>
                <TYPE>TURTLE</TYPE>
        </TARGET>

File https://downloads.dbpedia.org/repo/dbpedia/generic/labels/2020.06.01/labels_lang=en.ttl.bz2

Warnings

15:23:54.658 [main] [] WARN  org.apache.jena.riot:95 - [line: 772507, col: 1 ] Bad IRI: <http://dbpedia.org/resource/ACR_Alvorense_1º_Dezembro> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.658 [main] [] WARN  org.apache.jena.riot:95 - [line: 772507, col: 1 ] Bad IRI: <http://dbpedia.org/resource/ACR_Alvorense_1º_Dezembro> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.704 [main] [] WARN  org.apache.jena.riot:95 - [line: 779858, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Cheers> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.704 [main] [] WARN  org.apache.jena.riot:95 - [line: 779858, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Cheers> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779859, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Heroes_&_Villains> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779859, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Heroes_&_Villains> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779860, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Laughs> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779860, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Laughs> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779861, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Movie_Quotes> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779861, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Movie_Quotes> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779862, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Movies> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779862, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Movies> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779863, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Movies_(10th_Anniversary_Edition)> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.705 [main] [] WARN  org.apache.jena.riot:95 - [line: 779863, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Movies_(10th_Anniversary_Edition)> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779864, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Passions> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779864, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Passions> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779865, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Songs> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779865, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Songs> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779866, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Stars> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779866, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Stars> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.706 [main] [] WARN  org.apache.jena.riot:95 - [line: 779867, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Thrills> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779867, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years_…_100_Thrills> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779868, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Cheers> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779868, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Cheers> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779869, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Heroes_and_Villains> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779869, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Heroes_and_Villains> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779870, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Laughs> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779870, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Laughs> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779871, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Movie_Quotes> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779871, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Movie_Quotes> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779872, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Movies> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779872, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Movies> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779873, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Movies_(10th_Anniversary_Edition)> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779873, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Movies_(10th_Anniversary_Edition)> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
15:23:54.707 [main] [] WARN  org.apache.jena.riot:95 - [line: 779874, col: 1 ] Bad IRI: <http://dbpedia.org/resource/AFI's_100_Years…100_Passions> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.

Workaround I could get rid of the warnings by only selecting lines with purely ASCII characters with grep -P '^[[:ascii:]]+$.

Version LIMES started via java -Xmx11G -jar ~/opt/limes/limes-core/target/limes-core-1.7.4-SNAPSHOT.jar, master branch version 1.7.4-snapshot, commit ae81ba402c67e89ceb23f8cb872b01f5a5e25419. OpenJDK 14 on Arch Linux.

kvndrsslr commented 4 years ago

Yes, this is correct and if you feel that it should be otherwise, please report it to Apache Jena.

KonradHoeffner commented 4 years ago

Then it could also be an issue with DBpedia itself.