ScotGovAnalysis / opendatascot

An R package to pull data from statistics.gov.scot into R
https://scotgovanalysis.github.io/opendatascot/
MIT License
47 stars 6 forks source link

Consider tightening the definition of dataset #89

Closed RickMoynihan closed 3 years ago

RickMoynihan commented 5 years ago

This query lists all datasets on the statistics.gov.scot site, however it also includes datasets that the rest of the functions in the library which seem predominantly focused on datacubes can't handle, e.g. this dataset is a zip file.

It might be worth thinking about how you want to handle this, to ensure users can list datasets the API can work with. Perhaps you want two procedures, one to list all cubes and another to list all datasets? Or one to list all cubes, and another list to all non-cubes?

e.g. you could write:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (COUNT(DISTINCT ?URI) AS ?count)  
  WHERE {
      ?URI a qb:DataSet ;
           rdfs:label ?Name . 
  OPTIONAL {
     ?URI dcterms:publisher ?Pub.
     ?Pub rdfs:label ?Publisher.
  }
} order by ?URI

Incidentally I also noticed one dataset seems to be missing the publisher metadata, you might want to investigate why that is; or if it's legitimate consider putting the publisher metadata in an OPTIONAL as above.

GordonBryden commented 3 years ago

Agreed on both counts. I have limited the function to linked datasets by selecting them as an rdf:type as this seemed the simplest way to get the desired behaviour, and I've added the publisher as an optional metadata exactly as described.