Add DBPedia Direct Lookup

sfolsom commented 8 months ago

DBPedia's SPARQL Endpoint: https://dbpedia.org/sparql.

This might be too complicated: https://github.com/dbpedia/ontology-driven-api.

Likely want to be searching on a combination of rdfs:label and foaf:name. The rest of the modeling is too uneven across the different entity types to do anything general. I was considering dbo:abstract as a possibility for a display value, but the abstracts are too long to present to a cataloger in a lookup.

chrisrlc commented 8 months ago

Possible query: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?label. } WHERE { { ?uri rdfs:label ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } UNION { ?uri foaf:name ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3Flabel%20FILTER%20bif%3Acontains(%3Flabel%2C%20'%22Cristiano%20Ronaldo*%22')%20FILTER%20(langMatches(lang(%3Flabel)%2C%22en%22))%20%7D%20%7D%0A&format=text%2Fx-html%2Btr&timeout=30000)

Lang can be set by the user in an optional lang= parameter in the QA endpoint, but will be set to "en" by default.

@sfolsom Does this sparql query return the entities you expect? Are there additional fields you'd like to display with the context=true parameter set in the QA endpoint?

sfolsom commented 8 months ago

This is starting to get beyond my SPARQL experience, but I'm wondering about how to have a single response for each URI. That might be as simple as constructing only the rdfs:label.

Does the query above require for both the foaf:name and rdfs:label to be present and the same? (I don't have a lot of experience with UNION, and usually if you use the same variable for different statements like ?label, the value has to be the same.)

re: context, maybe we should add dbo:abstract. It's the only property I've found that's used consistently over all entity types. If Sinopia or other apps think the abstracts are too long they don't have to use them, or could set the app up to display up to a character limit with an ellipsis.

chrisrlc commented 8 months ago

The html table format for a CONSTRUCT query displays a row for each subject-predicate-object triple, so it'll look like the uri is listed multiple times, but the rdf-xml groups by uri correctly: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name. } WHERE { { ?uri rdfs:label ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } UNION { ?uri foaf:name ?name FILTER bif:contains(?name, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?name),"en")) } })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3Fname%20FILTER%20bif%3Acontains(%3Fname%2C%20'%22Cristiano%20Ronaldo*%22')%20FILTER%20(langMatches(lang(%3Fname)%2C%22en%22))%20%7D%20%7D&format=application%2Frdf%2Bxml&timeout=30000)

The UNION should just be working on the ?uri subject, so the number of distinct returned uris should be correct, but I've updated the query above to make the results differentiate properly between matching rdfs:label vs foaf:name - good call on that.

chrisrlc commented 8 months ago

This has been deployed to lookup-int and is ready for the first round of testing. Example endpoint: https://lookup-int.ld4l.org/authorities/search/linked_data/dbpedia_direct?q=Cristiano%20Ronaldo

With context: https://lookup-int.ld4l.org/authorities/search/linked_data/dbpedia_direct?q=Cristiano%20Ronaldo&context=true

It uses the following sparql query: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name; dbo:abstract ?abstract. } WHERE { { ?uri rdfs:label ?labelMatch FILTER(bif:contains(?labelMatch, '"Cristiano Ronaldo*"') && langMatches(lang(?labelMatch),"en")) } UNION { ?uri foaf:name ?nameMatch FILTER(bif:contains(?nameMatch, '"Cristiano Ronaldo*"') && langMatches(lang(?nameMatch),"en")) } OPTIONAL { ?uri rdfs:label ?label } OPTIONAL { ?uri foaf:name ?name } OPTIONAL {?uri dbo:abstract ?abstract } FILTER((!bound(?label) || langMatches(lang(?label),"en")) && (!bound(?name) || langMatches(lang(?name),"en")) && (!bound(?abstract) || langMatches(lang(?abstract),"en"))) })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3FnameMatch%20FILTER(bif%3Acontains(%3FnameMatch%2C%20'%22Cristiano%20Ronaldo*%22')%20%26%26%20langMatches(lang(%3FnameMatch)%2C%22en%22))%20%7D%20OPTIONAL%20%7B%20%3Furi%20rdfs%3Alabel%20%3Flabel%20%7D%20OPTIONAL%20%7B%20%3Furi%20foaf%3Aname%20%3Fname%20%7D%20OPTIONAL%20%7B%3Furi%20dbo%3Aabstract%20%3Fabstract%20%7D%20FILTER((!bound(%3Flabel)%20%7C%7C%20langMatches(lang(%3Flabel)%2C%22en%22))%20%26%26%20(!bound(%3Fname)%20%7C%7C%20langMatches(lang(%3Fname)%2C%22en%22))%20%26%26%20(!bound(%3Fabstract)%20%7C%7C%20langMatches(lang(%3Fabstract)%2C%22en%22)))%20%7D&format=application%2Frdf%2Bxml&timeout=30000)

The query phrase is searched across rdfs:label and foaf:name, but the label returned by the QA endpoint is rdfs:label only. Currently, this behavior excludes results that don't have an rdfs:label, e.g. https://dbpedia.org/page/Elche_CF__Cristiano_Ronaldo__1 and https://dbpedia.org/page/2002%E2%80%9303_Sporting_CP_season__Cristiano_Ronaldo__1. If this is not desired behavior, I can adjust this to either include both rdfs:label AND foaf:name in the Label value (e.g. "label": "[2008 FIFA Club World Cup squads, Aaron Scott, Adriano, Agustín Delgado, Ahmad El-Sayed, Ahmad S..."), or I can try to add some logic to set Label equal to either rdfs:label OR the first foaf:name if rdfs:label isn't available. Please let me know if you have a preference.

Adding context=true to the endpoint displays rdfs:label, foaf:name, and dbo:abstract.

The dbpedia test cases for accuracy (https://lookup-int.ld4l.org/check_status) are using the same ones as dbpedia_ld4l_cache, but a dbpedia_direct search for "volleyball" returns quite a lot more results since dbpedia_direct is searching across different fields. So the expected position is thousands of records off. These queries also take a long time because they're fetching many more records. We can adjust these test cases if these benchmarks aren't very useful.

Uses default QA sorting (alphabetic) because the sparql endpoint does not have a search relevancy field that we can sort by.

Please let me know if any of this behavior should be different.

chrisrlc commented 7 months ago

Updated lookup-int:

Removed query truncation handling in sparql query, but doesn't seem to have much of a noticeable improvement in speed CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name; dbo:abstract ?abstract. } WHERE { { ?uri rdfs:label ?labelMatch FILTER(bif:contains(?labelMatch, '"Cristiano Ronaldo"') && langMatches(lang(?labelMatch),"en")) } UNION { ?uri foaf:name ?nameMatch FILTER(bif:contains(?nameMatch, '"Cristiano Ronaldo"') && langMatches(lang(?nameMatch),"en")) } OPTIONAL { ?uri rdfs:label ?label } OPTIONAL { ?uri foaf:name ?name } OPTIONAL {?uri dbo:abstract ?abstract } FILTER((!bound(?label) || langMatches(lang(?label),"en")) && (!bound(?name) || langMatches(lang(?name),"en")) && (!bound(?abstract) || langMatches(lang(?abstract),"en"))) })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3FnameMatch%20FILTER(bif%3Acontains(%3FnameMatch%2C%20'%22Cristiano%20Ronaldo%22')%20%26%26%20langMatches(lang(%3FnameMatch)%2C%22en%22))%20%7D%20OPTIONAL%20%7B%20%3Furi%20rdfs%3Alabel%20%3Flabel%20%7D%20OPTIONAL%20%7B%20%3Furi%20foaf%3Aname%20%3Fname%20%7D%20OPTIONAL%20%7B%3Furi%20dbo%3Aabstract%20%3Fabstract%20%7D%20FILTER((!bound(%3Flabel)%20%7C%7C%20langMatches(lang(%3Flabel)%2C%22en%22))%20%26%26%20(!bound(%3Fname)%20%7C%7C%20langMatches(lang(%3Fname)%2C%22en%22))%20%26%26%20(!bound(%3Fabstract)%20%7C%7C%20langMatches(lang(%3Fabstract)%2C%22en%22)))%20%7D&format=application%2Frdf%2Bxml&timeout=30000)
- Searching for "Volleyball" (which is trying to return 10,000+ uris) still results in a 504 for me due to timeout. I experimented with doubling the timeout in the sparql query (30000 ms --> 60000 ms), but this didn't resolve the "Volleyball" timeout. I can try to experiment with this more, but you also mentioned it could be fine to just include a note when setting up the Sinopia templates about possible timeouts for large result sets?
Updated accuracy test cases to remove "volleyball" and "volleyb", added "Cristiano Ronaldo" and "amphora"
Defaulting to displaying foaf:name as label in json response if rdfs:label not available.

cul-it / qa_server

Add DBPedia Direct Lookup #389