Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes
Eclipse Public License 1.0
13 stars 4 forks source link

cube-pipeline ignores + character when generating code list URIs #114

Open jennet opened 4 years ago

jennet commented 4 years ago

Not sure if the problem will be the value template tags or something else, but if the code list has a concept with URI e.g. http://example.org/def/concept/age/100+ and the input CSV has a value "100+" (which is correctly set up in the columns.csv to convert to http://example.org/def/concept/age/{age} it generates http://example.org/def/concept/age/100, losing the + which then results in a broken link.

RickMoynihan commented 4 years ago

I think this is potentially related (though not the same as) your whitespace issues too:

https://github.com/Swirrl/table2qb/issues/111 https://github.com/Swirrl/table2qb/issues/113

+ is unfortunately another encoding for ` (space) so it might be that it is either being stripped for this reason, or it is being url decoded intohttp://example.org/def/concept/age/100 ` and then trimmed somewhere else.

https://stackoverflow.com/questions/2678551/when-to-encode-space-to-plus-or-20

Not sure what to do about it; it will depend and likely be subtle. I think the best bet is to possibly change our slugify implementation to convert + into something like plus -- but that has problems too.

RickMoynihan commented 4 years ago

In csv2rdf.uri-template the following happens:

(expand-template (parse-template "http://example.org/def/concept/age/{age}") {:age "100+"}) ;; => #object[java.net.URI 0x54f684ed "http://example.org/def/concept/age/100%2B"]
RickMoynihan commented 4 years ago

@jennet ok as I think we suspected this appears to be caused by table2qb slugize.

I don't know exactly what code path is being used in your case, (as I don't have your pipeline/config etc available) but I suspect you're somehow calling slugize which I think will end up using this implementation:

https://github.com/Swirrl/grafter-extra/blob/f4422fde5a8413be5313fb2bfdc3c9751a422be0/src/grafter/extra/cell/uri.clj#L5-L13

Essentially +'s get replaced with -s then any slugs ending in - are stripped off. The regex responsible for this is the line: (-> string ,,, (clojure.string/replace #"[^\w/]" "-")) where the ^\w means any non-word character; which includes +.

so:

(slugize "100+") ;;=> "100"

Strictly speaking I don't know that this is the cause of your problem; but if this were being used it would cause your issue; so I think it's highly likely this is the root cause.

Potential fixes:

  1. Do as Bill suggests and change the input to be 100 and over which will slugize better.
  2. Introduce a different sluggizer that can be configured to that column, that doesn't replace + with -.
jennet commented 4 years ago

@RickMoynihan thanks for looking into this. I think option 1 is the safest for now until we have got a more thorough development roadmap in place

RickMoynihan commented 4 years ago

Agreed. 2 is probably pretty easy to add, but might as well work around with 1 at this stage.

Robsteranium commented 4 years ago

I feel like we did this deliberately for sns-graft, but I forget why.

Given that a + is a valid part of a URI path then we don't need to strip it out. Tbh, the slugize action is probably a bit aggressive in rejecting non-words anyway. Perhaps there's a version available somewhere written by someone who's been through all the RFCs!