Create statistics on property usage

jneubert commented 6 years ago

As suggested by @kcoyle in https://github.com/dcmi/usage/issues/19#issuecomment-397354515 , we should create statistics of property usage in different large-scale dataset collections.

When we create a script and fetch the output in a file in this repository, this could be used to gather historical usage information (with all unsecurities re. changes within the collection, and probably unavailability of certain collections at some point in the future).

A simple format could be CSV, with one row per property in some defined order, and a column for each collection.

jneubert commented 6 years ago

For the LOD4ALL collection, this query could be used, paramterized with values for ?property.

Several restrictions apply for LOD4ALL:

Timeout of 60 sec (occurs for dct:subject)
Maximum of 100,000 triples (seems to hit with dct:contributor)

jneubert commented 6 years ago

@tombaker, @paulwalk Perhaps you have a simple list of all properties somewhere as configuration file for hugo?

tombaker commented 6 years ago

@tombaker, @paulwalk Perhaps you have a simple list of all properties somewhere as configuration file for hugo?

See https://github.com/dcmi/purls/blob/master/purls-handled-automatically-by-partial-redirects.txt

jneubert commented 6 years ago

@tombaker Thanks! I plan to create some simple perl script in the next days, which iterates over the list.

jneubert commented 6 years ago

For the Openlink dataset, this query can be used.

The dataset was created as demo for Openlinks Virtuoso software - I have not found information about its composition, about updates etc.

jneubert commented 6 years ago

Strange: For dct:creator, the above linked queries count only 10,950 triples in LOD4ALL, while more plausible 2,100,823 in Openlink.

kcoyle commented 6 years ago

I was poking around in lod4all trying to get a sense of the size of the data that it accesses, when I came across a dataset using "dcterms:" instead of "dct:". I don't know if there are other variations, but we'll need to get as many as we can.

Meanwhile, I'm still trying to understand that difference in counts. I'll report back.

kcoyle commented 6 years ago

OK, for reasons I cannot explain, changing the query for LOD4ALL to "dcterms" got me 2,100,823. I mean, I can sort of explain it, but it still makes me wonder about the content/extent of the data we are querying against.

jneubert commented 6 years ago

Hi Karen, I could not reproduce your last finding (re. dcterms: prefix), and I suppose ZBW's "SPARQL Lab" environment tricked you with having the query directed not against LOD4ALL but Openlink (it would be extremely unlikely to find exactly the same number of triples in both dataset).

Let me explain a bit (more in blog, code) how it works: The allows you to create links, which have the query file (a versioned file somewhere in Github) and the endpoint, against which the query is directed, as parameters. The environment makes it easy to edit the actual query, but changing the endpoint requires a change in the URL param. Both parameters are displayed on the second line of the web page, above the query form, yet are not very prominent.

tombaker commented 6 years ago

@kcoyle @jneubert FWIW, note that the prefix 'dc:' is mapped to http://purl.org/dc/terms/ in the RDFa Core Initial Context. prefix.cc also confirms that dc: is used for /terms/.

jneubert commented 6 years ago

@tombaker This 'dc:' in the RDFa context is a really bad thing, because 'dc:' is widely used for /elements/1.1/. So it should be better have left out from the RDFa context definitions. With our use of dct: we avoid that ambiguity. The web interface of LOD4ALL also includes dct: for /terms/ as a default.

Anyway, prefix mappings should not have anything to do with the query results, as long as the correct url stub for the prefix is provided with the query.

kcoyle commented 6 years ago

If you look at the lod4all statistics page it's pretty clear that this dataset is skewed to dbpedia, wikidata, and some oecd files. I don't think it represents DC use in general. That said, some of the stats are interesting ... even though I couldn't begin to interpret what they mean.

osma commented 6 years ago

I'm looking at creating statistics from LOD-a-lot. Unfortunately their LDF endpoint seems to be down so the only way to query the data is to download it locally and access it via HDT tools. It's not that huge (524GB total for the HDT and index files) but needs a machine with enough RAM (more than 16GB) and disk space (~1TB) and I'm struggling to find one that has both at the same time...

osma commented 6 years ago

Update: I found a machine with enough resources. Currently downloading the LOD-a-lot data, which will take a few hours. @jneubert can you provide me with the SPARQL queries you used for the statistics?

osma commented 6 years ago

@jneubert Nevermind, I see they are generated by the Perl script you just submitted. So I will try to set up a local Fuseki/HDT endpoint and try to execute the script against that.

jneubert commented 6 years ago

@osma That would be phantastic. The script should be able to cover it with just adding a new store + endpoint.

jneubert commented 6 years ago

Current statistics (now including LOV - Linked Open Vocabularies) are here (class, property)

osma commented 6 years ago

Here are the class stats from LOD-a-lot (merged into the previous stats from other sources). For many classes the counts seem to be higher, often much higher, than those from other sources. E.g. there are 31M BibliographicResource instances in LOD-a-lot, when LOD4ALL had 6k and others had zero. But not all, e.g. LCC and LCSH show up in LOD4ALL but not LOD-a-lot.

Still waiting for the property stats, some of the queries are rather slow...

jneubert commented 6 years ago

Just added a value type stats (IRI/Literal/Blank) from the Openlink store. @osma Please replace Openlink with LOD-a-lot, when you're ready, that seems to be by far the most complete one.

osma commented 6 years ago

Thanks for the value type query and stats! Started running it against LOD-a-lot.

osma commented 6 years ago

Here are the property stats including LOD-a-lot. Apparently it's not always the most complete source, Openlink has higher counts for many properties.

jneubert commented 6 years ago

Perhaps we should add a max column ...

osma commented 6 years ago

Here are the value type stats with both Openlink and LOD-a-lot sources. The stats look quite different to me, apparently the mix of data sets is different.

osma commented 6 years ago

I started wondering if it would be possible to make a federated SPARQL query that takes the union of LOD-a-lot and Openlink and counts each triple only once even if it appears in both data sets. I could run it on my local Fuseki with the LOD-a-lot data. I think this would be especially interesting for the property value type stats where it's clear the different aggregations include different source data sets.

osma commented 6 years ago

Sorry, hit the wrong button :-/

jneubert commented 6 years ago

Merged result sets from both sources would be great. In my experience, querying the external service first or in a separate subquery without existing bindings is essential for performance. Perhaps you could check it out with one infrequent and one frequent property, and see how it works.

tombaker commented 6 years ago

@osma @jneubert If this work is complete for now, could we save the results, along with a small README.txt, somewhere on this repo? If the results are only linked to and discussed in this thread, it will be easy to lose track of them, whereas we would ideally look for a way to monitor usage on an ongoing basis.

dcmi / usage

Create statistics on property usage #38