ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
416 stars 52 forks source link

New GND endpoint -> now named DNB (Deutsche Nationalbibliothek) #1180

Open brunnerpaul opened 11 months ago

brunnerpaul commented 11 months ago

Many thanks, @hannahbast and @joka921, for the swift implementation of the GND dataset during the Wikidata Data Modelling days!

So far, all queries at https://qlever.cs.uni-freiburg.de/gnd result in the error Unexpected token '<', "<!DOCTYPE "... is not valid JSON.

According to #1171 that might just be a generic error message. Could you please have a look into what's the cause here?

hannahbast commented 11 months ago

@brunnerpaul Thanks for the reminder! That just means that the backend is down (the supoptimal error message is a temporay bug in the QLever UI, we are currently working on a refactoring PR concerning this). Here is what happened:

During the meeting last week, I set up an instance for a selection of files from https://data.dnb.de/opendata , which worked fine. After the meeting, I tried to set up an instance for all the data I could find on https://data.dnb.de/opendata, namely:

curl -L -C - --remote-name-all https://data.dnb.de/opendata/authorities-gnd_lds.nt.gz https://data.dnb.de/opendata/bib_lds.nt.gz https://data.dnb.de/opendata/dnb-all_lds.nt.gz https://data.dnb.de/opendata/dnb-all_ldsprov.nt.gz https://data.dnb.de/opendata/zdb_lds.nt.gz

Unfortunately, it turned out that some of these files are not formatted correctly, and like most SPARQL engines, QLever refuses to index data that is not formatted correctly. Two questions:

  1. Which files should we index? If only a subset of the above, why only a subset?

  2. Wouldn't the name dnb be more appropriate for the instance than gnd? The many abbreviations used on the site are really confusing (dnb, gnd, lds, zdb, ...)

brunnerpaul commented 11 months ago
  1. Which files should we index? If only a subset of the above, why only a subset?

I've added what I know about the specific datasets here:

https://data.dnb.de/opendata/authorities-gnd_lds.nt.gz
1.9G
Stabiler Link auf den aktuellen Gesamtabzug der GND im Format RDF (N-Triples)

This is the most commonly used dataset AFAIK, authority files for persons, institutions, places, thesauri, and should have the highest priority.

https://data.dnb.de/opendata/bib_lds.nt.gz 4.7M
Stabiler Link auf den aktuellen Gesamtabzug der Adressdatei (ISIL- und Sigelverz.) im Format RDF (N-Triples)

This can be left out IMO, it’s kind of an address book of partner libraries.

https://data.dnb.de/opendata/dnb-all_lds.nt.gz
4.5G
Stabiler Link auf den aktuellen Gesamtabzug der DNB-Titeldaten im Format RDF (N-Triples)

This is bibliographic data, all the books of the German National Library, I think. Would be nice if Qlever could offer that because of its size which makes processing the file on smaller machines difficult, but with lower priority.

https://data.dnb.de/opendata/dnb-all_ldsprov.nt.gz 1.2G
Stabiler Link auf den aktuellen Gesamtabzug Metadatenprovenienz DNB-Titeldaten im Format RDF (N-Triples)

If the bibliographic data is offered, this should also be offered. Also lower priority.

https://data.dnb.de/opendata/zdb_lds.nt.gz 549M
Stabiler Link auf den aktuellen Gesamtabzug ZDB-Titeldaten im Format RDF (N-Triples)

More bibliographic data, magazines only. Also lower priority.

  1. Wouldn't the name dnb be more appropriate for the instance than gnd? The many abbreviations used on the site are really confusing (dnb, gnd, lds, zdb, ...)

Yes, good point. "DNB" (Deutsche Nationalbibliothek) as the data provider makes more sense as a name, especially if you want to add other datasets in the future. People working in GLAM mostly use the GND dataset (and call it "GND") but it would be good to keep the instance more general-purpose and have that reflected in the name "DNB".

hannahbast commented 11 months ago

Thanks, Paul, that was very helpful indeed.

I have now indexed all the files you listed, except bib_lds.nt.gz because that contains malformed IRIs. Good that you say that it's not important and can be left out. The file dnb-all_ldsprov.nt.gz contains several invalid floating point literals, but QLever has an option to ignore those, which I did.

The instance is now live under https://qlever.cs.uni-freiburg.de/dnb . A few interesting example queries would be welcome (you can just post them in reply to this issue if you have any).

I have also added a Qleverfile for whoever wants to host an instance themselves: https://github.com/ad-freiburg/qlever-control/blob/python-qlever/Qleverfiles/Qleverfile.dnb

hannahbast commented 11 months ago

@brunnerpaul Does it work for you now?

brunnerpaul commented 11 months ago

Works great, thanks a lot!

I'll put together some sample queries and post them here. I have a few queries that I could combine into a single more useful query now because Qlever can just process it in one go.