Swirrl / ook

Structural search engine
https://search-prototype.gss-data.org.uk/
Eclipse Public License 1.0
6 stars 0 forks source link

Make URI interpolation more robust #122

Closed Robsteranium closed 1 year ago

Robsteranium commented 2 years ago

ook.etl is choking on the latest data from beta because it includes URIs like:

http://gss-data.org.uk/data/climate-change/beis-2020-uk-greenhouse-gas-emissions-final-figures-dataset-of-emissions-by-source/2020-uk-greenhouse-gas-emissions-final-figures-dataset-of-emissions-by-source.csv#obs/CH4,CH4,3B14,2001,agriculture,wastes,horses-wastes,managed-manure,other-emissions,other-emissions,agricultural-horses@emissions-ar4-gwps

I'm not exactly what's causing it but I suspect the comma , and/ or at-signs @ are to blame.

The naive string munging in ook.etl/insert-values-clause yields broken queries like:

VALUES ?observation { <"http://gss-data.org.uk/data/climate-change/beis-2020-uk-greenhouse-gas-emissions-final-figures-dataset-of-emissions-by-source/2020-uk-greenhouse-gas-emissions-final-figures-dataset-of-emissions-by-source.csv#obs/C2F6,PFCs,2B9b3,1990,industrial-processes,not-applicable,halocarbon-production,halocarbons-production-fugitive,other-emissions,other-emissions,non-fuel-combustion@emissions-ar4-gwps"> <"http://gss-data.org.uk/data/climate-change/beis-2020-uk-greenhouse-gas-emissions-final-figures-dataset-of-emissions-by-source/2020-uk-greenhouse-gas-emissions-final-figures-dataset-of-emissions-by-source.csv#obs/C2F6,PFCs,2B9b3,1991,industrial-processes,not-applicable,halocarbon-production,halocarbons-production-fugitive,other-emissions,other-emissions,non-fuel-combustion@emissions-ar4-gwps"> ... }

NB: the URI is escaped with double-quotes.

We need to revise the approach. It's hopefully trivial to fix but if not we might want to reach for a proper library - I'll bet @andrewmcveigh's sparqler is a bit more robust!

Robsteranium commented 1 year ago

It wasn't actually string interpolation but parsing SPARQL results. @callum-oakley has resolved this in #124.