Closed canwaf closed 2 years ago
So I did some testing on this and I'll describe my findings below:
@base
to set the document's Base URLGiven the following CSV:
Period
year/2020
month/2020-01
And the following metadata JSON document:
{
"@context": ["http://www.w3.org/ns/csvw", {"@base": "https://example.org/"}],
"@id": "#dataset",
"tables": [
{
"url": "problem.csv",
"tableSchema": {
"columns": [
{
"titles": "Period",
"name": "period",
"propertyUrl": "#dimension/period",
"valueUrl": "http://reference.data.gov.uk/id/{+period}"
}
],
"aboutUrl": "http://some/observation/{+period}"
}
}
]
}
I get the following output from csv2rdf:
csv2rdf -u problem.csv-metadata.json -m annotated
13:23:14.328 [main] ERROR csv2rdf.main - #error {
:cause clj-http: status 404
:data {:request-time 993, :repeatable? false, :protocol-version {:name HTTP, :major 1, :minor 1}, :streaming? true, :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase Not Found, :headers {Cache-Control max-age=604800, Content-Type text/html; charset=UTF-8, Date Tue, 13 Jul 2021 13:23:14 GMT, Expires Tue, 20 Jul 2021 13:23:14 GMT, Server EOS (vny/044F), Vary Accept-Encoding, Connection close}, :orig-content-encoding nil, :status 404, :length -1, :body <!doctype html>
<html>
<head>
<title>Example Domain</title>
... html continues - removed by robons for brevity ...
, :trace-redirects []}
:via
[{:type clojure.lang.ExceptionInfo
:message clj-http: status 404
:data {:request-time 993, :repeatable? false, :protocol-version {:name HTTP, :major 1, :minor 1}, :streaming? true, :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase Not Found, :headers {Cache-Control max-age=604800, Content-Type text/html; charset=UTF-8, Date Tue, 13 Jul 2021 13:23:14 GMT, Expires Tue, 20 Jul 2021 13:23:14 GMT, Server EOS (vny/044F), Vary Accept-Encoding, Connection close}, :orig-content-encoding nil, :status 404, :length -1, :body <!doctype html>
<html>
<head>
<title>Example Domain</title>
... html continues - removed by robons for brevity ...
, :trace-redirects []}
:at [slingshot.support$stack_trace invoke support.clj 201]}]
:trace
[[slingshot.support$stack_trace invoke support.clj 201]
[clj_http.client$exceptions_response invokeStatic client.clj 239]
... again, edited for brevity ...
[csv2rdf.main main nil -1]]}
csv2rdf
is evidently trying to resolve the CSV file at http://example.org/problem.csv which it naturally fails to do.
The CSV-W spec confims that this is the expected behaviour
A table description might contain:
Example 5
"url": "example-2014-01-03.csv"
in which case the url property on the table would have a single value, a link to example-2014-01-03.csv, resolved against the base URL of the metadata document in which this was located. For example if the metadata document contained:
Example 6
"@context": [ "http://www.w3.org/ns/csvw", { "@base": "http://example.org/" }]
this is equivalent to specifying:
Example 7
The CSV-W spec also confirms that the base URL is not used as a base for relative URIs defined in json-ld metadata:
Note that the @base property of the @context object provides the base URL used for URLs within the metadata document, not the URLs that appear as data within the group of tables or table it describes. URI template properties are not resolved against this base URL: they are resolved against the URL of the table.
i.e. if we define the following CSV-W metadata json file:
{
"@context": "http://www.w3.org/ns/csvw",
"@id": "#dataset",
"tables": [
{
"url": "this-should-be-where-uris-are-relative-to.csv",
"tableSchema": {
"columns": [
{
"titles": "Period",
"name": "period",
"propertyUrl": "#dimension/period",
"valueUrl": "http://reference.data.gov.uk/id/{+period}"
}
],
"aboutUrl": "http://some/observation/{+period}"
}
}
]
}
Then we get the following RDF output from csv2rdf:
<http://some/observation/year/2020> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
<http://reference.data.gov.uk/id/year/2020> .
<http://some/observation/month/2020-01> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
<http://reference.data.gov.uk/id/month/2020-01> .
@base
property which can be defined in the CSV-W is not a property which allows us to set the base URL we want json-ld metadata to be defined relative to.csv2rdf
application to allow us to specify what the json-ld metadata's base URL should be. csv2rdf
process which finds-and-replaces all occurances of file:///workspace/...
with an alternative base URL. @rossbowen @ajtucker what are your thoughts on how we want to proceed here?
Thanks @robons for the writeup! The @base
+ url
+ relative_uri
is in line with what I'd thought so that's good.
So in theory, if we hosted a CSV file at http://gss-data.org.uk/data/
+ some-dataset-name
, then some #relative
JSON-LD would resolve to http://gss-data.org.uk/data/some-dataset-name#relative
.
Something comes to mind about how when running tests you can spoof urls to make it seem like resources exist when really they're held locally - though that might be a bit of a stretch here and it might be easier to ask Swirrl if csv2rdf
might be able to do that for us.
So in theory, if we hosted a CSV file at
http://gss-data.org.uk/data/
+some-dataset-name
, then some#relative
JSON-LD would resolve tohttp://gss-data.org.uk/data/some-dataset-name#relative
.
If we hosted a CSV file at http://gss-data.org.uk/data/some-dataset-name.csv
then the RDF statements generated by columns would be like: http://gss-data.org.uk/data/some-dataset-name.csv#relative
and the JSON-LD metadata URLs would point at http://gss-data.org.uk/data/some-dataset-name.csv-metadata.json#relative
.
We can add some more consistency though, I've done some work in #97 which ensures that the JSON-LD & columns URIs both point to the CSV instead of the metadata JSON file.
Something comes to mind about how when running tests you can spoof urls to make it seem like resources exist when really they're held locally - though that might be a bit of a stretch here and it might be easier to ask Swirrl if csv2rdf might be able to do that for us.
Yes, we could make a proxy which resolved these URIs with the local files, however it'd be simpler to implement that inside csv2rdf than do that on our end.
Given that you can have multiple tables defined inside a given CSV-W, I do wonder whether it'd be simpler for us to write a tool that did the mappings we wanted instead of asking Swirrl to alter CSV2RDF to take the mappings we want to perform. So shall I create a task to create a python tool to do this mapping? I already have the aptly defined PMD project for tools like this?
If we hosted a CSV file at http://gss-data.org.uk/data/some-dataset-name.csv then the RDF statements generated by columns would be like: http://gss-data.org.uk/data/some-dataset-name.csv#relative and the JSON-LD metadata URLs would point at http://gss-data.org.uk/data/some-dataset-name.csv-metadata.json#relative.
Ah sure - so what I'm unsure of at this point is, if we could place the CSV and the metadata in the right locations whether we can make the URIs look how we want running the normal csv2rdf
algorithm. We could place the CSV file at http://gss-data.org.uk/data/some-dataset-name
and then the relative URIs defined in the CSVW columns would look how we liked, but other JSON-LD wouldn't because it would still have .csv-metadata.json
in the URI.
I don't support any sort of find-and-replace style post processing for changing URIs in the turtle output - I think it would be better to write absolute URIs in the CSVW than do that.
We could place the CSV file at http://gss-data.org.uk/data/some-dataset-name and then the relative URIs defined in the CSVW columns would look how we liked, but other JSON-LD wouldn't because it would still have .csv-metadata.json in the URI.
I can make both the column and JSON-LD URIs point to the CSV for consistency.
I don't support any sort of find-and-replace style post processing for changing URIs in the turtle output - I think it would be better to write absolute URIs in the CSVW than do that.
So taking this as our direction, we'll need to provide some ability for the user to specify the base URI they want to use for all new URIs generated. I just want to check we're happy with this direction @ajtucker?
The URI resolution algo should help us here, e.g. http://example.com/data/set/obs.csv-metadata.json resolved with codelists/local-area#abc ends up looking like http://example.com/data/set/codelists/local-area#abc.
The URI resolution algo should help us here, e.g. http://example.com/data/set/obs.csv-metadata.json resolved with codelists/local-area#abc ends up looking like http://example.com/data/set/codelists/local-area#abc.
@ajtucker, given that we're going to be loading CSV-Ws that haven't necessarily been published, we're trying to make the decision as to whether we do a find-and-replace style operation on the local file:/some/csv-w.csv-metadata.json
URI to convert it to the absolute http://example.com/...
URI or whether we just drop relative URIs and put the http://example.com/...
URI directly into the CSV-W.
Do you have any thoughts on this?
I'd recommend we test out the assumptions above with how csv2rdf
works with remote resources using a simple python3 -m http.server 8080
to serve the files locally from the filesystem, use relative URIs in the -metadata.json
and run csv2rdf -t http://localhost:8080/codelists/blah.csv -u http://localhost:8080/codelists/blah.csv-metadata.json -m annotated
Then, yes, just pipe the output through something like sed 's|file:///blah/blah/|http://gss-data.org.uk/blah/blah/|'
to achieve the result we want. I don't think csv2rdf
does anything too clever wrt. namespace prefixes in the Turtle output, but if it does, we can always pipe through a Turtle to NTriples converter and run sed
on that output.
Thinking out loud, downsides to me are:
file://whatever
is a way of QAing whether the transformation process has gone correctly or not - I've definitely used the fact I've spotted those in output ttl
to know something's gone wrong. I wouldn't get those if we pipe via sed
. Something to consider.csv2rdf
locally and get to what the output will look like without piping through sed
. The other thing I wouldn't mind a second pair of eyes on, would be if by piping through to sed
we don't get an equivalent output to what we would have gotten if we placed the CSV and the metadata files at the right location. Consider from Rob's example above:
{
"@context": "http://www.w3.org/ns/csvw",
"@id": "#dataset",
"tables": [
{
"url": "this-should-be-where-uris-are-relative-to.csv",
"tableSchema": {
"columns": [
{
"titles": "Period",
"name": "period",
"propertyUrl": "#dimension/period",
"valueUrl": "http://reference.data.gov.uk/id/{+period}"
}
],
"aboutUrl": "http://some/observation/{+period}"
}
}
]
}
<http://some/observation/year/2020> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
<http://reference.data.gov.uk/id/year/2020> .
<http://some/observation/month/2020-01> <file:/workspace/this-should-be-where-uris-are-relative-to.csv#dimension/period>
<http://reference.data.gov.uk/id/month/2020-01> .
If we use sed 's|file:/workspace/this-should-be-where-uris-are-relative-to.csv|http://gss-data.org.uk/data/some-dataset-name|'
we get the output:
<http://some/observation/year/2020> <http://gss-data.org.uk/data/some-dataset-name#dimension/period>
<http://reference.data.gov.uk/id/year/2020> .
<http://some/observation/month/2020-01> <http://gss-data.org.uk/data/some-dataset-name#dimension/period>
<http://reference.data.gov.uk/id/month/2020-01> .
Which is the output we're after. But what I'm not sure of yet is whether we could get that from the same CSVW, without use of sed
, if you placed the exact same identical CSVW metadata and CSV file at the right locations on the web.
@robons I'm looking at the issue you've linked here but it looks to me like _doc_rel_uri()
will put the file extension .csv
into URIs because self.csv_file_name
has .csv
in it.
Re: QA -- agreed. It's a shame that the RDF output has to use absolute URIs.
I still reckon that a good first step would be to serve the files locally http://localhost:8080/blah
to check that the URI resolution process does what you expect with both file:///some/dir/structure/
and http://localhost:8080/some/path/
.
I was going to propose firing up something like WireMock and setting that as a web proxy for csv2rdf
, but then remembered that getting JVM apps to use a web proxy can be tricky. You can set system properties http.proxyHost
and http.proxyPort
, but really you'd want it to take into account the de-facto standard of setting the environment variable http_proxy
.
I reckon that the proof wouldn't be any stronger with "real" URIs as opposed to "http://localhost" ones, but that it might make you feel better :)
@ajtucker, you say you have to use absolute urls in RDF. Can we not just use a prefix for the QA it'll be assigned to the local resources, but when we move to publish that prefix gets updated to where it ultimately ends up? This might make @rossbowen's QA easier as well.
I've not got too far in figuring out how to get csv2rdf
to use an HTTP proxy. The following should work, but doesn't seem to:
docker run -v $PWD:/workspace -w /workspace -it gsscogs/csv2rdf java -Dlog4j2.configurationFile=/usr/local/share/log4j2.xml -Dhttp.proxyHost=localhost -Dhttp.proxyPort=8080 -jar /usr/local/share/java/csv2rdf.jar
At least according to the Java docs and what I'm guessing the Clojure HTTP library should do.
However, when I try it with a @base
pointing to http://example.com
, it still goes off and fetches the resource from Example Domain.
Right, so i just started a local webserver on my macbook and ran the following command:
csv2rdf -u 'http://192.168.1.30/some-qube.csv-metadata.json' -m annotated -o output.ttl
With a document using URIs like some-qube.csv#dimension/d-code-list
, I get output like the following:
<http://192.168.1.30/some-qube.csv#component/d-code-list> a <http://purl.org/linked-data/cube#ComponentSet>,
<http://purl.org/linked-data/cube#ComponentSpecification>;
<http://purl.org/linked-data/cube#componentProperty> <http://192.168.1.30/some-qube.csv#dimension/d-code-list>;
<http://purl.org/linked-data/cube#dimension> <http://192.168.1.30/some-qube.csv#dimension/d-code-list> .
See attached for all files concerned.
output.ttl
is the result of the stated command above.
https://app.zenhub.com/files/374709947/70f81a88-3027-4b29-830b-5c6c8309cfa2/download
N.B. with URIs like .#dimension/d-code-list
we end up with DSD URIs like http://192.168.1.30/#dimension/a-code-list
, but we end up with CSV-derived URIs looking like http://192.168.1.30/some-qube.csv#dimension/d-code-list
. So as long as we consistently point URIs at the CSV file, we shouldn't have any real problems (unless the user renames the file).
edit: #dimension/d-code-list
and /#dimension/d-code-list
URIs don't work any better either.
Summary:
.csv
or .csv-metadata.json
in them. Need to try a bit harder in getting an example with content negotiation working.csv2rdf
transformation as opposed to a find and replace.Finally got a bit of a reprex working (which works from within a docker env too so I can use vscode's devcontainer stuff). Here I can host files and set some config through /etc/hosts
to mimic the files being available at http://data.gov.uk/
.
LOCAL_WORKSPACE_FOLDER=$PWD
docker run \
-v $LOCAL_WORKSPACE_FOLDER:/workspace \
-w /workspace \
--name http-server \
--add-host=data.gov.uk:0.0.0.0 \
-d python:latest \
python3 -m http.server 80
docker run \
-v $LOCAL_WORKSPACE_FOLDER:/workspace \
-w /workspace \
--network container:http-server \
--rm -it gsscogs/csv2rdf \
csv2rdf -u http://data.gov.uk/statistics/df.csv-metadata.json -m annotated -o output.ttl
docker stop http-server
docker rm http-server
So, say we had our CSV and CSVW metadata inside a folder named statistics
.
./
├─ statistics/
│ ├─ df.csv
│ ├─ df.csv-metadata.json
Then with a trivial CSV:
one,two,three,four
1,2,3,4
...and CSVW metadata:
{
"@context": "http://www.w3.org/ns/csvw",
"url": "df.csv",
"@id": "example-data",
"http://ex.org": [
{"@id": "a"},
{"@id": "/b"},
{"@id": "../c"},
{"@id": "#d"}
],
"tableSchema": {
"columns": [
{
"name": "one",
"datatype": "integer",
"propertyUrl": "w"
},
{
"name": "two",
"datatype": "integer",
"propertyUrl": "/x"
},
{
"name": "three",
"datatype": "integer",
"propertyUrl": "../y"
},
{
"name": "four",
"datatype": "integer",
"propertyUrl": "#z"
}
]
}
}
Then we get the following output:
# jsonld stuff
<http://data.gov.uk/statistics/example-data> <http://ex.org> <http://data.gov.uk/b>,
<http://data.gov.uk/c>, <http://data.gov.uk/statistics/a>, <http://data.gov.uk/statistics/df.csv-metadata.json#d> .
# csvw stuff
_:bnode__1 <http://data.gov.uk/statistics/df.csv#z> 4;
<http://data.gov.uk/statistics/w> 1;
<http://data.gov.uk/x> 2;
<http://data.gov.uk/y> 3 .
@rossbowen to do a final test to ensure that ../path
works inside a 2-deep directory structure (making it more portable than the domain.com/path
approach.
Summary is, it's behaving how we may expect any file system to behave.
Folder structure:
./
├─ one/
│ ├─ two/
│ │ ├─ three/
│ │ │ ├─ df.csv
│ │ │ ├─ df.csv-metadata.json
JSON-LD in CSVW:
"http://ex.org": [
{"@id": "../a"},
{"@id": "../../b"}
{"@id": "../../../c"}
],
Columns spec in CSVW:
"columns": [
{
"name": "one",
"datatype": "integer",
"propertyUrl": "../x"
},
{
"name": "two",
"datatype": "integer",
"propertyUrl": "../../y"
},
{
"name": "three",
"datatype": "integer",
"propertyUrl": "../../../z"
}
]
With the following docker
command:
docker run \
-v $LOCAL_WORKSPACE_FOLDER:/workspace \
-w /workspace \
--network container:http-server \
--rm -it gsscogs/csv2rdf \
csv2rdf -u http://data.gov.uk/statistics/one/two/three/df.csv-metadata.json -m annotated -o output.ttl
Returns turtle
looking like,
<http://data.gov.uk/statistics/one/two/three/example-data> <http://ex.org> <http://data.gov.uk/statistics/c>,
<http://data.gov.uk/statistics/one/b>, <http://data.gov.uk/statistics/one/two/a> .
_:bnode__1 <http://data.gov.uk/statistics/one/two/x> 1;
<http://data.gov.uk/statistics/one/y> 2;
<http://data.gov.uk/statistics/z> 3 .
I'm still prepping an example using content negotiation to see whether it's possible to craft fragment (#
) URIs without the .csv
in them.
@rossbowen We now have RDF uploaded to PMD. I've left it as-is with the .csv
bits still inplace. It would be good to have a firm idea where we're going with these URIs now that we can start uploading real data to PMD.
We've decided that the URIs we're currently using are good enough for now since they allow users to independently publish CSV-Ws with URIs which successfully dereference to a sensible document (the CSV). The RDF definitions of said terms can be generated by locating the conventionally named JSON-LD metadata file and converting the whole CSV-W to RDF.
We'll look to add support for users being able to request that we define URIs which don't make use of hash fragments so that where they have a suitable server in place, it can return information at the most granular level when the user dereferences a URI.
First, check if we can use base-uri property in CSV-W to generate URIs for relative URIs, if not... If we can, proceed by creating a Jenkins pipeline issue to allow for the configuration of the base-uri property. If we can't, chuck this over the fence for csv2rdf as a feature request.
Basic test case is whether does csv2rdf break with the base-uri with a URL in it, if it tries to find a remote CSV then it breaks. If it uses the local CSV then it works.
Investigate whether csv2rdf and csvlint follow the csvw spec.