ONSdigital / csvcubed

A CLI to build linked data cubes.
https://gss-cogs.github.io/csvcubed-docs/external/
Apache License 2.0
12 stars 1 forks source link

Define a relative-uris for base-uris in place of file:// #61

Closed canwaf closed 2 years ago

canwaf commented 3 years ago

First, check if we can use base-uri property in CSV-W to generate URIs for relative URIs, if not... If we can, proceed by creating a Jenkins pipeline issue to allow for the configuration of the base-uri property. If we can't, chuck this over the fence for csv2rdf as a feature request.

Basic test case is whether does csv2rdf break with the base-uri with a URL in it, if it tries to find a remote CSV then it breaks. If it uses the local CSV then it works.

Investigate whether csv2rdf and csvlint follow the csvw spec.

robons commented 3 years ago

So I did some testing on this and I'll describe my findings below:

Attempting to use the @base to set the document's Base URL

Given the following CSV:

Period
year/2020
month/2020-01

And the following metadata JSON document:

{
    "@context": ["http://www.w3.org/ns/csvw", {"@base": "https://example.org/"}],
    "@id": "#dataset",
    "tables": [
        {
            "url": "problem.csv",
            "tableSchema": {
                "columns": [
                    {
                        "titles": "Period",
                        "name": "period",
                        "propertyUrl": "#dimension/period",
                        "valueUrl": "http://reference.data.gov.uk/id/{+period}"
                    }
                ],
                "aboutUrl": "http://some/observation/{+period}"
            }
        }
    ]
}

I get the following output from csv2rdf:

csv2rdf -u problem.csv-metadata.json -m annotated
13:23:14.328 [main] ERROR csv2rdf.main - #error {
 :cause clj-http: status 404
 :data {:request-time 993, :repeatable? false, :protocol-version {:name HTTP, :major 1, :minor 1}, :streaming? true, :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase Not Found, :headers {Cache-Control max-age=604800, Content-Type text/html; charset=UTF-8, Date Tue, 13 Jul 2021 13:23:14 GMT, Expires Tue, 20 Jul 2021 13:23:14 GMT, Server EOS (vny/044F), Vary Accept-Encoding, Connection close}, :orig-content-encoding nil, :status 404, :length -1, :body <!doctype html>
<html>
<head>
    <title>Example Domain</title>

    ... html continues - removed by robons for brevity ...

, :trace-redirects []}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message clj-http: status 404
   :data {:request-time 993, :repeatable? false, :protocol-version {:name HTTP, :major 1, :minor 1}, :streaming? true, :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase Not Found, :headers {Cache-Control max-age=604800, Content-Type text/html; charset=UTF-8, Date Tue, 13 Jul 2021 13:23:14 GMT, Expires Tue, 20 Jul 2021 13:23:14 GMT, Server EOS (vny/044F), Vary Accept-Encoding, Connection close}, :orig-content-encoding nil, :status 404, :length -1, :body <!doctype html>
<html>
<head>
    <title>Example Domain</title>

    ... html continues - removed by robons for brevity ...

, :trace-redirects []}
   :at [slingshot.support$stack_trace invoke support.clj 201]}]
 :trace
 [[slingshot.support$stack_trace invoke support.clj 201]
  [clj_http.client$exceptions_response invokeStatic client.clj 239]

  ... again, edited for brevity ...

  [csv2rdf.main main nil -1]]}

csv2rdf is evidently trying to resolve the CSV file at http://example.org/problem.csv which it naturally fails to do.

What the CSV-W Spec Says

The CSV-W spec confims that this is the expected behaviour

A table description might contain:

Example 5

"url": "example-2014-01-03.csv"

in which case the url property on the table would have a single value, a link to example-2014-01-03.csv, resolved against the base URL of the metadata document in which this was located. For example if the metadata document contained:

Example 6

"@context": [ "http://www.w3.org/ns/csvw", { "@base": "http://example.org/" }]

this is equivalent to specifying:

Example 7

"url": "http://example.org/example-2014-01-03.csv"

The CSV-W spec also confirms that the base URL is not used as a base for relative URIs defined in json-ld metadata:

Note that the @base property of the @context object provides the base URL used for URLs within the metadata document, not the URLs that appear as data within the group of tables or table it describes. URI template properties are not resolved against this base URL: they are resolved against the URL of the table.

i.e. if we define the following CSV-W metadata json file:

{
    "@context": "http://www.w3.org/ns/csvw",
    "@id": "#dataset",
    "tables": [
        {
            "url": "this-should-be-where-uris-are-relative-to.csv",
            "tableSchema": {
                "columns": [
                    {
                        "titles": "Period",
                        "name": "period",
                        "propertyUrl": "#dimension/period",
                        "valueUrl": "http://reference.data.gov.uk/id/{+period}"
                    }
                ],
                "aboutUrl": "http://some/observation/{+period}"
            }
        }
    ]
}

Then we get the following RDF output from csv2rdf:

<http://some/observation/year/2020> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
    <http://reference.data.gov.uk/id/year/2020> .

<http://some/observation/month/2020-01> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
    <http://reference.data.gov.uk/id/month/2020-01> .

Conclusions

@rossbowen @ajtucker what are your thoughts on how we want to proceed here?

rossbowen commented 3 years ago

Thanks @robons for the writeup! The @base + url + relative_uri is in line with what I'd thought so that's good.

So in theory, if we hosted a CSV file at http://gss-data.org.uk/data/ + some-dataset-name, then some #relative JSON-LD would resolve to http://gss-data.org.uk/data/some-dataset-name#relative.

Something comes to mind about how when running tests you can spoof urls to make it seem like resources exist when really they're held locally - though that might be a bit of a stretch here and it might be easier to ask Swirrl if csv2rdf might be able to do that for us.

robons commented 3 years ago

So in theory, if we hosted a CSV file at http://gss-data.org.uk/data/ + some-dataset-name, then some #relative JSON-LD would resolve to http://gss-data.org.uk/data/some-dataset-name#relative.

If we hosted a CSV file at http://gss-data.org.uk/data/some-dataset-name.csv then the RDF statements generated by columns would be like: http://gss-data.org.uk/data/some-dataset-name.csv#relative and the JSON-LD metadata URLs would point at http://gss-data.org.uk/data/some-dataset-name.csv-metadata.json#relative.

We can add some more consistency though, I've done some work in #97 which ensures that the JSON-LD & columns URIs both point to the CSV instead of the metadata JSON file.

Something comes to mind about how when running tests you can spoof urls to make it seem like resources exist when really they're held locally - though that might be a bit of a stretch here and it might be easier to ask Swirrl if csv2rdf might be able to do that for us.

Yes, we could make a proxy which resolved these URIs with the local files, however it'd be simpler to implement that inside csv2rdf than do that on our end.

Given that you can have multiple tables defined inside a given CSV-W, I do wonder whether it'd be simpler for us to write a tool that did the mappings we wanted instead of asking Swirrl to alter CSV2RDF to take the mappings we want to perform. So shall I create a task to create a python tool to do this mapping? I already have the aptly defined PMD project for tools like this?

rossbowen commented 3 years ago

If we hosted a CSV file at http://gss-data.org.uk/data/some-dataset-name.csv then the RDF statements generated by columns would be like: http://gss-data.org.uk/data/some-dataset-name.csv#relative and the JSON-LD metadata URLs would point at http://gss-data.org.uk/data/some-dataset-name.csv-metadata.json#relative.

Ah sure - so what I'm unsure of at this point is, if we could place the CSV and the metadata in the right locations whether we can make the URIs look how we want running the normal csv2rdf algorithm. We could place the CSV file at http://gss-data.org.uk/data/some-dataset-name and then the relative URIs defined in the CSVW columns would look how we liked, but other JSON-LD wouldn't because it would still have .csv-metadata.json in the URI.

I don't support any sort of find-and-replace style post processing for changing URIs in the turtle output - I think it would be better to write absolute URIs in the CSVW than do that.

robons commented 3 years ago

We could place the CSV file at http://gss-data.org.uk/data/some-dataset-name and then the relative URIs defined in the CSVW columns would look how we liked, but other JSON-LD wouldn't because it would still have .csv-metadata.json in the URI.

I can make both the column and JSON-LD URIs point to the CSV for consistency.

I don't support any sort of find-and-replace style post processing for changing URIs in the turtle output - I think it would be better to write absolute URIs in the CSVW than do that.

So taking this as our direction, we'll need to provide some ability for the user to specify the base URI they want to use for all new URIs generated. I just want to check we're happy with this direction @ajtucker?

ajtucker commented 3 years ago

The URI resolution algo should help us here, e.g. http://example.com/data/set/obs.csv-metadata.json resolved with codelists/local-area#abc ends up looking like http://example.com/data/set/codelists/local-area#abc.

ajtucker commented 3 years ago

See https://datatracker.ietf.org/doc/html/rfc3986#section-5.2.3

robons commented 3 years ago

The URI resolution algo should help us here, e.g. http://example.com/data/set/obs.csv-metadata.json resolved with codelists/local-area#abc ends up looking like http://example.com/data/set/codelists/local-area#abc.

@ajtucker, given that we're going to be loading CSV-Ws that haven't necessarily been published, we're trying to make the decision as to whether we do a find-and-replace style operation on the local file:/some/csv-w.csv-metadata.json URI to convert it to the absolute http://example.com/... URI or whether we just drop relative URIs and put the http://example.com/... URI directly into the CSV-W.

Do you have any thoughts on this?

ajtucker commented 3 years ago

I'd recommend we test out the assumptions above with how csv2rdf works with remote resources using a simple python3 -m http.server 8080 to serve the files locally from the filesystem, use relative URIs in the -metadata.json and run csv2rdf -t http://localhost:8080/codelists/blah.csv -u http://localhost:8080/codelists/blah.csv-metadata.json -m annotated

Then, yes, just pipe the output through something like sed 's|file:///blah/blah/|http://gss-data.org.uk/blah/blah/|' to achieve the result we want. I don't think csv2rdf does anything too clever wrt. namespace prefixes in the Turtle output, but if it does, we can always pipe through a Turtle to NTriples converter and run sed on that output.

rossbowen commented 3 years ago

Thinking out loud, downsides to me are:

The other thing I wouldn't mind a second pair of eyes on, would be if by piping through to sed we don't get an equivalent output to what we would have gotten if we placed the CSV and the metadata files at the right location. Consider from Rob's example above:

{
    "@context": "http://www.w3.org/ns/csvw",
    "@id": "#dataset",
    "tables": [
        {
            "url": "this-should-be-where-uris-are-relative-to.csv",
            "tableSchema": {
                "columns": [
                    {
                        "titles": "Period",
                        "name": "period",
                        "propertyUrl": "#dimension/period",
                        "valueUrl": "http://reference.data.gov.uk/id/{+period}"
                    }
                ],
                "aboutUrl": "http://some/observation/{+period}"
            }
        }
    ]
}
<http://some/observation/year/2020> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
    <http://reference.data.gov.uk/id/year/2020> .

<http://some/observation/month/2020-01> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
    <http://reference.data.gov.uk/id/month/2020-01> .

If we use sed 's|file:/workspace/this-should-be-where-uris-are-relative-to.csv|http://gss-data.org.uk/data/some-dataset-name|' we get the output:

<http://some/observation/year/2020> <http://gss-data.org.uk/data/some-dataset-name#dimension/period>
    <http://reference.data.gov.uk/id/year/2020> .

<http://some/observation/month/2020-01> <http://gss-data.org.uk/data/some-dataset-name#dimension/period>
    <http://reference.data.gov.uk/id/month/2020-01> .

Which is the output we're after. But what I'm not sure of yet is whether we could get that from the same CSVW, without use of sed, if you placed the exact same identical CSVW metadata and CSV file at the right locations on the web.

@robons I'm looking at the issue you've linked here but it looks to me like _doc_rel_uri() will put the file extension .csv into URIs because self.csv_file_name has .csv in it.

ajtucker commented 3 years ago

Re: QA -- agreed. It's a shame that the RDF output has to use absolute URIs.

I still reckon that a good first step would be to serve the files locally http://localhost:8080/blah to check that the URI resolution process does what you expect with both file:///some/dir/structure/ and http://localhost:8080/some/path/.

I was going to propose firing up something like WireMock and setting that as a web proxy for csv2rdf, but then remembered that getting JVM apps to use a web proxy can be tricky. You can set system properties http.proxyHost and http.proxyPort, but really you'd want it to take into account the de-facto standard of setting the environment variable http_proxy.

I reckon that the proof wouldn't be any stronger with "real" URIs as opposed to "http://localhost" ones, but that it might make you feel better :)

canwaf commented 3 years ago

@ajtucker, you say you have to use absolute urls in RDF. Can we not just use a prefix for the QA it'll be assigned to the local resources, but when we move to publish that prefix gets updated to where it ultimately ends up? This might make @rossbowen's QA easier as well.

ajtucker commented 3 years ago

I've not got too far in figuring out how to get csv2rdf to use an HTTP proxy. The following should work, but doesn't seem to:

docker run -v $PWD:/workspace -w /workspace -it gsscogs/csv2rdf java -Dlog4j2.configurationFile=/usr/local/share/log4j2.xml -Dhttp.proxyHost=localhost -Dhttp.proxyPort=8080 -jar /usr/local/share/java/csv2rdf.jar

At least according to the Java docs and what I'm guessing the Clojure HTTP library should do.

However, when I try it with a @base pointing to http://example.com, it still goes off and fetches the resource from Example Domain.

robons commented 3 years ago

Right, so i just started a local webserver on my macbook and ran the following command:

csv2rdf -u 'http://192.168.1.30/some-qube.csv-metadata.json' -m annotated -o output.ttl

With a document using URIs like some-qube.csv#dimension/d-code-list, I get output like the following:

<http://192.168.1.30/some-qube.csv#component/d-code-list> a <http://purl.org/linked-data/cube#ComponentSet>,
    <http://purl.org/linked-data/cube#ComponentSpecification>;
  <http://purl.org/linked-data/cube#componentProperty> <http://192.168.1.30/some-qube.csv#dimension/d-code-list>;
  <http://purl.org/linked-data/cube#dimension> <http://192.168.1.30/some-qube.csv#dimension/d-code-list> .

See attached for all files concerned.

output.ttl is the result of the stated command above. https://app.zenhub.com/files/374709947/70f81a88-3027-4b29-830b-5c6c8309cfa2/download

N.B. with URIs like .#dimension/d-code-list we end up with DSD URIs like http://192.168.1.30/#dimension/a-code-list, but we end up with CSV-derived URIs looking like http://192.168.1.30/some-qube.csv#dimension/d-code-list. So as long as we consistently point URIs at the CSV file, we shouldn't have any real problems (unless the user renames the file).

edit: #dimension/d-code-list and /#dimension/d-code-list URIs don't work any better either.

rossbowen commented 3 years ago

Summary:

Finally got a bit of a reprex working (which works from within a docker env too so I can use vscode's devcontainer stuff). Here I can host files and set some config through /etc/hosts to mimic the files being available at http://data.gov.uk/.

LOCAL_WORKSPACE_FOLDER=$PWD

docker run \
-v $LOCAL_WORKSPACE_FOLDER:/workspace \
-w /workspace \
--name http-server \
--add-host=data.gov.uk:0.0.0.0 \
-d python:latest \
python3 -m http.server 80

docker run \
-v $LOCAL_WORKSPACE_FOLDER:/workspace \
-w /workspace \
--network container:http-server \
--rm -it gsscogs/csv2rdf \
csv2rdf -u http://data.gov.uk/statistics/df.csv-metadata.json -m annotated -o output.ttl

docker stop http-server
docker rm http-server

So, say we had our CSV and CSVW metadata inside a folder named statistics.

./
├─ statistics/
│  ├─ df.csv
│  ├─ df.csv-metadata.json

Then with a trivial CSV:

one,two,three,four
1,2,3,4

...and CSVW metadata:


{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "df.csv",
  "@id": "example-data",
  "http://ex.org": [
      {"@id": "a"}, 
      {"@id": "/b"}, 
      {"@id": "../c"}, 
      {"@id": "#d"}
  ],
  "tableSchema": {
    "columns": [
      {
        "name": "one",
        "datatype": "integer",
        "propertyUrl": "w"
      },
      {
        "name": "two",
        "datatype": "integer",
        "propertyUrl": "/x"
      },
      {
        "name": "three",
        "datatype": "integer",
        "propertyUrl": "../y"
      },
      {
        "name": "four",
        "datatype": "integer",
        "propertyUrl": "#z"
      }
    ]
  }
}

Then we get the following output:

# jsonld stuff
<http://data.gov.uk/statistics/example-data> <http://ex.org> <http://data.gov.uk/b>,
    <http://data.gov.uk/c>, <http://data.gov.uk/statistics/a>, <http://data.gov.uk/statistics/df.csv-metadata.json#d> .

# csvw stuff
_:bnode__1 <http://data.gov.uk/statistics/df.csv#z> 4;
  <http://data.gov.uk/statistics/w> 1;
  <http://data.gov.uk/x> 2;
  <http://data.gov.uk/y> 3 .
robons commented 3 years ago

@rossbowen to do a final test to ensure that ../path works inside a 2-deep directory structure (making it more portable than the domain.com/path approach.

rossbowen commented 3 years ago

Summary is, it's behaving how we may expect any file system to behave.

Folder structure:

./
├─ one/
│  ├─ two/
│  │  ├─ three/
│  │  │  ├─ df.csv
│  │  │  ├─ df.csv-metadata.json

JSON-LD in CSVW:

  "http://ex.org": [
      {"@id": "../a"}, 
      {"@id": "../../b"}
      {"@id": "../../../c"}
  ],

Columns spec in CSVW:

    "columns": [
      {
        "name": "one",
        "datatype": "integer",
        "propertyUrl": "../x"
      },
      {
        "name": "two",
        "datatype": "integer",
        "propertyUrl": "../../y"
      },
      {
        "name": "three",
        "datatype": "integer",
        "propertyUrl": "../../../z"
      }
    ]

With the following docker command:

docker run \
-v $LOCAL_WORKSPACE_FOLDER:/workspace \
-w /workspace \
--network container:http-server \
--rm -it gsscogs/csv2rdf \
csv2rdf -u http://data.gov.uk/statistics/one/two/three/df.csv-metadata.json -m annotated -o output.ttl

Returns turtle looking like,

<http://data.gov.uk/statistics/one/two/three/example-data> <http://ex.org> <http://data.gov.uk/statistics/c>,
    <http://data.gov.uk/statistics/one/b>, <http://data.gov.uk/statistics/one/two/a> .

_:bnode__1 <http://data.gov.uk/statistics/one/two/x> 1;
  <http://data.gov.uk/statistics/one/y> 2;
  <http://data.gov.uk/statistics/z> 3 .
rossbowen commented 3 years ago

I'm still prepping an example using content negotiation to see whether it's possible to craft fragment (#) URIs without the .csv in them.

robons commented 2 years ago

@rossbowen We now have RDF uploaded to PMD. I've left it as-is with the .csv bits still inplace. It would be good to have a firm idea where we're going with these URIs now that we can start uploading real data to PMD.

robons commented 2 years ago

We've decided that the URIs we're currently using are good enough for now since they allow users to independently publish CSV-Ws with URIs which successfully dereference to a sensible document (the CSV). The RDF definitions of said terms can be generated by locating the conventionally named JSON-LD metadata file and converting the whole CSV-W to RDF.

We'll look to add support for users being able to request that we define URIs which don't make use of hash fragments so that where they have a suitable server in place, it can return information at the most granular level when the user dereferences a URI.