Swirrl / csv2rdf

Clojure library and command line application for converting CSV to RDF. An implementation of the W3C CSVW specifications
Eclipse Public License 1.0
25 stars 6 forks source link

csv2rdf is tolerant of invalid JSON #62

Closed RickMoynihan closed 9 months ago

RickMoynihan commented 3 years ago

I was surprised to discover that csv2rdf is tolerant of broken json in its metadata files, in particular missing commas etc.

Without wanting to open up the tolerant reader debate I think it would be better if we could also require that JSON always be valid, and raise an error when it's not.

Personally I find JSON a bit awkward to write (it's very easy to accidentally leave superfluous comma or colon delimiters in the json or accidentally omit them), and I worry that others crafting JSON metadata without extra tooling support will create JSON that is invalid and only we can parse.

I think it would be better to require strict parsing to ensure that people are conservative in what they create.

It looks like this is an unexpected behaviour we have inherited from clojure.data.json/read-str:

;; with missing comma (invalid json) parses fine:
user> (clojure.data.json/read-str "{\"foo\":\"bar\" \"baz\":\"blah\"}") ;;=> {"foo" "bar", "baz" "blah"}
;; valid json with comma
user> (clojure.data.json/read-str "{\"foo\":\"bar\", \"baz\":\"blah\"}") ;;=> {"foo" "bar", "baz" "blah"}

I don't think the data.json parser supports a strict parsing flag though so we might not be able to support this without changing JSON parser in which case we may wish to drop support for tolerant reading of JSON; or pick up a dependency on another parser that supports strict mode. I'd very much favour not having the extra dependency, but it may depend on whether we expect to handle invalid JSON.

Incidentally I checked the JSON spec, and it looks like strictly speaking it's valid for JSON processors to interpret broken syntax; they're just required to also handle valid JSON.

RickMoynihan commented 3 years ago

ONS lint their JSON with csvlint which also validates CSVW and JSON, so they're unlikely to have invalid JSON metadata, so changing the default here shouldn't pose too many problems.

There is always the argument about being tolerant of 3rd party broken JSON though, however I'd rather require people opt in to any such behaviour behind a command line flag.

Choices seem to be:

  1. Leave as is (helping people create invalid JSON)
  2. Switch to a strict parse only (being intollerant of invalid 3rd party JSON)
  3. Support a flag e.g. -tolerate-broken-json for enabling a tolerant parse.

3 is arguably the most desireable from a user perspective, however ideally not at the expense of having to use two different json parsers.

If we can find a suitable parser that supports both a strict and a lax mode we could switch to that to support 3, otherwise I think it's arguably better to pick 2 and switch to a strict parse. How much broken JSON do we expect to have to handle in the wild? I'm guessing not that much, particularly if we're advocating a tool set that emphasises correctness.

stevorobs3 commented 9 months ago

The version of clojure.data.json that we use is now a stricter parser, so it looks like we'll need to go with option 2.

(json/read-str "{\"foo\":\"bar\", \"baz\":\"blah\"}") => {"foo" "bar", "baz" "blah"} (json/read-str "{\"foo\":\"bar\" \"baz\":\"blah\"}") Execution error at clojure.data.json/read-object (json.clj:312). JSON error (missing entry in object)

stevorobs3 commented 9 months ago

Closed by this commit: https://github.com/Swirrl/csv2rdf/commit/de1ec6fc761cd97c9031ae91be2b1dd495acf25f