Swirrl / csv2rdf

Clojure library and command line application for converting CSV to RDF. An implementation of the W3C CSVW specifications
Eclipse Public License 1.0
26 stars 6 forks source link

Explicitly fail if no metadata file is located #188

Closed RickMoynihan closed 11 months ago

RickMoynihan commented 2 years ago

Related to issue #186 - when running csv2rdf with just a -t table csv2rdf does not locate the metadata document, and instead performs the default conversion.

The default conversion generates a literal RDF representation of the csv, which is of little use to us in most cases. In most cases it would be better to fail with an explicit error; rather than proceding to generate data of little practical value.

I'd suggest we:

  1. Fail hard in the above case, writing an error message to stderr.
  2. Provide a new command line flag --proceed-without-metadata to engage the current behaviour (generating the default RDFization of the literal CSV where there is no metadata document).
Robsteranium commented 2 years ago

This would contradict the spec by default - I'd be inclined to invert the behaviour of that option.

I wonder if it'd be useful to surface the result of the steps taken to locate the metadata (though idk how easily it'd be to work with this via the CLI).

RickMoynihan commented 11 months ago

Ok, it looks like @Robsteranium is correct, and the spec results in us using "embeded metadata", which is all optional and undefined. However that section says the following (in the case where no explicit embedded is used):

Parsing based on the default dialect for CSV, as described in 8. Parsing Tabular Data, will extract column titles from the first row of a CSV file.

So this then becomes our fallback "metadata document" which results in the useless RDF.

If we're to be spec conformant we would need to

  1. Log a warning when only -t is supplied and we have fallen back to using embedded metadata.
  2. Support a flag that fails if we've fallen back to implicit/embedded metadata.

However after some more reflection I think it may be better to deviate from the spec in this regard, and fail fast on the RDFization in this case.

I just don't think the output data is useful at all, or ever what anyone would want or expect. This feels very much like an accidental outcome of the spec.

I think we should just change the behaviour. We could add an option in the future to be spec compliant in this regard; but I honestly think nobody would ever want to enable it :-)

lkitching commented 11 months ago

While the 'embedded' output is rarely useful, it's not clear what benefit there would be to deviating from the spec here? If it's to guard against accidentally failing to supply a metadata document, this would be obvious in the output.

RickMoynihan commented 11 months ago

We've agreed to close this, because you should normally only be RDFIzing and expecting meaningful output if you have a metadata document, and if you have a metadata document, in an automated context it's always better to start explicitly from there rather than the CSV.