cul-it / qa_server

A rails app with questioning authority gem installed to serve as a QA server.
Apache License 2.0
1 stars 6 forks source link

Add DBPedia Direct Lookup #389

Open sfolsom opened 8 months ago

sfolsom commented 8 months ago

DBPedia's SPARQL Endpoint: https://dbpedia.org/sparql.

This might be too complicated: https://github.com/dbpedia/ontology-driven-api.

Likely want to be searching on a combination of rdfs:label and foaf:name. The rest of the modeling is too uneven across the different entity types to do anything general. I was considering dbo:abstract as a possibility for a display value, but the abstracts are too long to present to a cataloger in a lookup.

chrisrlc commented 8 months ago

Possible query: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?label. } WHERE { { ?uri rdfs:label ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } UNION { ?uri foaf:name ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3Flabel%20FILTER%20bif%3Acontains(%3Flabel%2C%20'%22Cristiano%20Ronaldo*%22')%20FILTER%20(langMatches(lang(%3Flabel)%2C%22en%22))%20%7D%20%7D%0A&format=text%2Fx-html%2Btr&timeout=30000)

Lang can be set by the user in an optional lang= parameter in the QA endpoint, but will be set to "en" by default.

@sfolsom Does this sparql query return the entities you expect? Are there additional fields you'd like to display with the context=true parameter set in the QA endpoint?

sfolsom commented 8 months ago

This is starting to get beyond my SPARQL experience, but I'm wondering about how to have a single response for each URI. That might be as simple as constructing only the rdfs:label.

Does the query above require for both the foaf:name and rdfs:label to be present and the same? (I don't have a lot of experience with UNION, and usually if you use the same variable for different statements like ?label, the value has to be the same.)

re: context, maybe we should add dbo:abstract. It's the only property I've found that's used consistently over all entity types. If Sinopia or other apps think the abstracts are too long they don't have to use them, or could set the app up to display up to a character limit with an ellipsis.

chrisrlc commented 8 months ago

The html table format for a CONSTRUCT query displays a row for each subject-predicate-object triple, so it'll look like the uri is listed multiple times, but the rdf-xml groups by uri correctly: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name. } WHERE { { ?uri rdfs:label ?label FILTER bif:contains(?label, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?label),"en")) } UNION { ?uri foaf:name ?name FILTER bif:contains(?name, '"Cristiano Ronaldo*"') FILTER (langMatches(lang(?name),"en")) } })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3Fname%20FILTER%20bif%3Acontains(%3Fname%2C%20'%22Cristiano%20Ronaldo*%22')%20FILTER%20(langMatches(lang(%3Fname)%2C%22en%22))%20%7D%20%7D&format=application%2Frdf%2Bxml&timeout=30000)

The UNION should just be working on the ?uri subject, so the number of distinct returned uris should be correct, but I've updated the query above to make the results differentiate properly between matching rdfs:label vs foaf:name - good call on that.

chrisrlc commented 8 months ago

This has been deployed to lookup-int and is ready for the first round of testing. Example endpoint: https://lookup-int.ld4l.org/authorities/search/linked_data/dbpedia_direct?q=Cristiano%20Ronaldo

With context: https://lookup-int.ld4l.org/authorities/search/linked_data/dbpedia_direct?q=Cristiano%20Ronaldo&context=true

It uses the following sparql query: CONSTRUCT { ?uri rdfs:label ?label; foaf:name ?name; dbo:abstract ?abstract. } WHERE { { ?uri rdfs:label ?labelMatch FILTER(bif:contains(?labelMatch, '"Cristiano Ronaldo*"') && langMatches(lang(?labelMatch),"en")) } UNION { ?uri foaf:name ?nameMatch FILTER(bif:contains(?nameMatch, '"Cristiano Ronaldo*"') && langMatches(lang(?nameMatch),"en")) } OPTIONAL { ?uri rdfs:label ?label } OPTIONAL { ?uri foaf:name ?name } OPTIONAL {?uri dbo:abstract ?abstract } FILTER((!bound(?label) || langMatches(lang(?label),"en")) && (!bound(?name) || langMatches(lang(?name),"en")) && (!bound(?abstract) || langMatches(lang(?abstract),"en"))) })%20%7D%20UNION%20%7B%20%3Furi%20foaf%3Aname%20%3FnameMatch%20FILTER(bif%3Acontains(%3FnameMatch%2C%20'%22Cristiano%20Ronaldo*%22')%20%26%26%20langMatches(lang(%3FnameMatch)%2C%22en%22))%20%7D%20OPTIONAL%20%7B%20%3Furi%20rdfs%3Alabel%20%3Flabel%20%7D%20OPTIONAL%20%7B%20%3Furi%20foaf%3Aname%20%3Fname%20%7D%20OPTIONAL%20%7B%3Furi%20dbo%3Aabstract%20%3Fabstract%20%7D%20FILTER((!bound(%3Flabel)%20%7C%7C%20langMatches(lang(%3Flabel)%2C%22en%22))%20%26%26%20(!bound(%3Fname)%20%7C%7C%20langMatches(lang(%3Fname)%2C%22en%22))%20%26%26%20(!bound(%3Fabstract)%20%7C%7C%20langMatches(lang(%3Fabstract)%2C%22en%22)))%20%7D&format=application%2Frdf%2Bxml&timeout=30000)

The query phrase is searched across rdfs:label and foaf:name, but the label returned by the QA endpoint is rdfs:label only. Currently, this behavior excludes results that don't have an rdfs:label, e.g. https://dbpedia.org/page/Elche_CF__Cristiano_Ronaldo__1 and https://dbpedia.org/page/2002%E2%80%9303_Sporting_CP_season__Cristiano_Ronaldo__1. If this is not desired behavior, I can adjust this to either include both rdfs:label AND foaf:name in the Label value (e.g. "label": "[2008 FIFA Club World Cup squads, Aaron Scott, Adriano, Agustín Delgado, Ahmad El-Sayed, Ahmad S..."), or I can try to add some logic to set Label equal to either rdfs:label OR the first foaf:name if rdfs:label isn't available. Please let me know if you have a preference.

Adding context=true to the endpoint displays rdfs:label, foaf:name, and dbo:abstract.

The dbpedia test cases for accuracy (https://lookup-int.ld4l.org/check_status) are using the same ones as dbpedia_ld4l_cache, but a dbpedia_direct search for "volleyball" returns quite a lot more results since dbpedia_direct is searching across different fields. So the expected position is thousands of records off. These queries also take a long time because they're fetching many more records. We can adjust these test cases if these benchmarks aren't very useful.

Uses default QA sorting (alphabetic) because the sparql endpoint does not have a search relevancy field that we can sort by.

Please let me know if any of this behavior should be different.

chrisrlc commented 7 months ago

Updated lookup-int: