ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
417 stars 52 forks source link

Predicates of the form `@de@<http://rdaregistry.info/Elements/u/P60470>` in query result #1513

Closed epoz closed 1 month ago

epoz commented 1 month ago

When indexing data with QLever, sometimes strange values are returned for the predicates. To reproduce:

curl https://qlever.cs.uni-freiburg.de/api/dnb --data 'query=SELECT%20%3Fp%20%3Fo%20WHERE%0A%20%7B%0A%20%20%3Chttps%3A//d-nb.info/454818084%3E%20%3Fp%20%3Fo%20.%0A%20%7D' -X POST

This produces values that look like:

{
    "head": {
        "vars": [
            "p",
            "o"
        ]
    },
    "results": {
        "bindings": [
            {
                "o": {
                    "type": "bnode",
                    "value": "bn27771275"
                },
                "p": {
                    "type": "uri",
                    "value": "http://id.loc.gov/vocabulary/relators/aut"
                }
            },
---8<----   etc.  ---8<---- 
            {
                "o": {
                    "type": "literal",
                    "value": "mit Abb.",
                    "xml:lang": "de"
                },
                "p": {
                    "type": "literal",
                    "value": "@de@<http://rdaregistry.info/Elements/u/P60470>"
                }
            }
        ]
    }
}

Notice the weird literal shown as a predicate, as if there was a linebreak issue in parsing the source data. Here we are showing one of the example datasets provided by the authors, the DNB set.

But this has also happened in some cases for triples that I have created myself, where I could verify that the input triples did not seem malformed.

joka921 commented 1 month ago

Just to make sure that I understand your question correctly: Are you wondering about the strange predicate @de@<http:....> which is (wrongly) marked as a literal? Those predicates and their triples are artifacts of QLever's implementation of language filters. They are currently leaked for queries without a fixed predicate in all triples (your predicate is a variable ?p). It is on our list to fully hide these triples (wrt to exports/query results/ statistics) such that this can't happen.

Thanks for reporting this, we now have an issue to track. It is also interesting to see, how these triples currently interact with the JSON exporter. Please confirm that I have correctly identified your issue.

epoz commented 1 month ago

Thanks for the answer, yes I was wondering about the predicates being marked as literals in the SPARQL query results. First time I saw something like this was in a CMS vendor publishing their Linked Data one year ago. Filed a bug report there, but never heard back from them. Now we know that they are using QLever as a triplestore "under the covers" ;-)

I am new to QLever, just ran some tests ingesting data last week and noticed the strange output. At first I thought it was something wrong in my own data, but then noticed it was also in the sample datasets provided. (and I recalled seeing this previously elsewhere, as mentioned)

Look forward to learning more about the internals later, good to know that this is an implementation artefact and will eventually disappear.

hannahbast commented 1 month ago

@epoz Interesting story, which CMS vendor was this? Is there a issue on their GitHub or was this some internal bug tracker?

epoz commented 1 month ago

It is a Dutch CMS for cultural heritage content: https://kleksi.com/nl/home Their service is not open source, and does not have a bug tracker. I submitted a request to their support@ mail, and received a reply via email later from the owner that they would pass my query on to their tech department. Never heard back from them, and I did not consider it important enough to follow up.

This was the event where we were spelunking for interesting data sources.