Open jennet opened 4 years ago
I think this is potentially related (though not the same as) your whitespace issues too:
https://github.com/Swirrl/table2qb/issues/111 https://github.com/Swirrl/table2qb/issues/113
+
is unfortunately another encoding for ` (space) so it might be that it is either being stripped for this reason, or it is being url decoded into
http://example.org/def/concept/age/100 ` and then trimmed somewhere else.
https://stackoverflow.com/questions/2678551/when-to-encode-space-to-plus-or-20
Not sure what to do about it; it will depend and likely be subtle. I think the best bet is to possibly change our slugify implementation to convert +
into something like plus
-- but that has problems too.
In csv2rdf.uri-template the following happens:
(expand-template (parse-template "http://example.org/def/concept/age/{age}") {:age "100+"}) ;; => #object[java.net.URI 0x54f684ed "http://example.org/def/concept/age/100%2B"]
@jennet ok as I think we suspected this appears to be caused by table2qb slugize.
I don't know exactly what code path is being used in your case, (as I don't have your pipeline/config etc available) but I suspect you're somehow calling slugize which I think will end up using this implementation:
Essentially +
's get replaced with -
s then any slugs ending in -
are stripped off. The regex responsible for this is the line: (-> string ,,, (clojure.string/replace #"[^\w/]" "-"))
where the ^\w
means any non-word character; which includes +
.
so:
(slugize "100+") ;;=> "100"
Strictly speaking I don't know that this is the cause of your problem; but if this were being used it would cause your issue; so I think it's highly likely this is the root cause.
Potential fixes:
100 and over
which will slugize better.+
with -
.@RickMoynihan thanks for looking into this. I think option 1 is the safest for now until we have got a more thorough development roadmap in place
Agreed. 2 is probably pretty easy to add, but might as well work around with 1 at this stage.
I feel like we did this deliberately for sns-graft, but I forget why.
Given that a +
is a valid part of a URI path then we don't need to strip it out. Tbh, the slugize action is probably a bit aggressive in rejecting non-words anyway. Perhaps there's a version available somewhere written by someone who's been through all the RFCs!
Not sure if the problem will be the value template tags or something else, but if the code list has a concept with URI e.g.
http://example.org/def/concept/age/100+
and the input CSV has a value "100+" (which is correctly set up in the columns.csv to convert tohttp://example.org/def/concept/age/{age}
it generateshttp://example.org/def/concept/age/100
, losing the+
which then results in a broken link.