Swirrl / table2qb

A generic pipeline for converting tabular data into rdf data cubes
Eclipse Public License 1.0
13 stars 4 forks source link

Fix codelist generation when there is whitespace in input CSV column headings #115

Closed jennet closed 4 years ago

jennet commented 4 years ago

Closes #113

Robsteranium commented 4 years ago

I've been lurching to-and-fro on this one!

I like this from a tolerant reader perspective, and it's certainly user-friendly.

I do worry about silently fixing stuff though as it can hide problem and doesn't really help cultivate an understanding among users.

I wondered about using the validation error messages instead (what follows is in the context of #102).

This is fine for e.g. trailing white space on the "Label" column header:

Missing required columns: Label. Found columns: "Label ","Notation","Parent Notation"

A comparison of the two lines should make the mistake apparent.

But if there's trailing white space on "Notation", because that column can't be found we add it in with a default derived from the label. This results in the message:

Unexpected columns: "Notation "

Which isn't quite as obvious.

Moreover, I noticed that #102 introduces white space trimming for the component attachment in the columns csv. If we're doing that there, then we may as well do that here.

We probably ought to merge #102 first then update this PR to match before merging it too.

lkitching commented 4 years ago

@Robsteranium - I think the current plan is to avoid silently modifying the inputs and instead to improve the error reporting to make the issues easier to identify and fix. So for consistency I think we should avoid trimming the inputs in #102 and improve the error messages there.

Robsteranium commented 4 years ago

Ok that makes sense.

I think then that #102 will solve #113 in the sense that we won't output an unusable codelist csvw.

I think the above error message is ok, although I can imagine people overlooking it - especially if they're not viewing it in a monospace font!

I suppose it might be an improvement to have something like:

Unexpected columns: "Notation " Found columns: "Label". Derived column: "Notation"

But this might lead to even more confusion!

What do you think @jennet?

RickMoynihan commented 4 years ago

I like this from a tolerant reader perspective, and it's certainly user-friendly.

@Robsteranium I think we've spoken about this before, but in discussions with Alex about handling various edge cases of broken RDF; and I don't think the Tolerant Reader pattern means you should coerce things that you can into data you understand. I think it just means you should ignore structure that you don't care about or understand i.e. don't depend on incidental details, write code that's open for extension, and doesn't break when somebody else adds a new key to a json map for example.

I think regarding coercion specifically though; table2qb is (correct me if I'm wrong) is usually targetting data that's been created by another process. i.e. we're already in the land of automation; so I think having a stricter process makes sense, along with tighter constraints on what is acceptable and more explicit failures.

That's not to say that we don't need a validation step, just that it might be better happening in its own phase, that can be run independently.

Robsteranium commented 4 years ago

Yeah, I guess I'm referring to the Postel's Law bit underneath - that systems are more robust if you can be more forgiving in what you accept as input (while being no less exact in your own outputs). In this case we can probably forgive accidental white space without too much risk (it's unlikely that we'll ever want to treat trailing or leading white space as having a semantic value, at least in the headers).

Manually created input is a valid use case for table2qb and indeed that's the context in which #113 arose. One of the original ideas for the architecture was to have a relaxed input which was processed into a stricter one so this approach is at least consistent with that.

That said, I feel like the validation from #102 is probably enough, but I don't have a strong opinion!

RickMoynihan commented 4 years ago

In that case I guess it makes sense to trim whitespace in the headers 👍 and you're right that they rarely carry semantic value, at least in column headers.

My issue with postels law is that once you concede to it you end up forever giving ground to it; recreating bug for bug compatibility with 3rd party software systems. Rather than motivating patching upstream systems, it tends to perpetuate sloppiness throughout the system; making it unclear who's job it to fix any error; and establishing that slop as a defacto standard. Postels law caused decades of problems with HTML, that eventually needed to restandardised into HTML5, at huge expense. If a browser with market dominance had been stricter about its inputs on day one then the web would have been a lot better; instead we got a race to the bottom that drove the cost of browser creation through the roof.

Regardless I'm not denying the need for supporting relaxed inputs; just consider me voting for processing them into stricter ones! :-)

Robsteranium commented 4 years ago

It might be worth thinking more concretely about the risk. I'd say the main risk here is that we miss out on an opportunity to "teach" the user about white space. As such we should expect to continue to find leading/ trailing whitespace everywhere!

If we trim headers we'll need to do so in all of the pipelines. We'll also need to do so in cells too given that a)t headers also match against cells in e.g. the columns configuration and that b) we'll start to see e.g. multiple labels per code URI that will be impossible to distinguish in the UI where whitespace isn't always apparent.

Once we get to cells it starts to be more risky. A trailing space could provide a hint that a value had been truncated or that a piece was missing, for example. I'd say the likelihood is lower, but the impact is higher as we wouldn't be able to guard against this sort of mistake with validation and trimming strings by default would potentially hide the problem. Instead, stricter validation might encourage the user to be more careful about their input (and the preparation process behind it).

Robsteranium commented 4 years ago

Ok, following a phone call we've decided not to trim whitespace for now.

I've created another issue to capture the idea of introducing this with a feature flag to toggle "relaxed vs strict" processing: #122.

I've also removed the trimming in the component-type look up as part of #102 discussion thread.

Trimming as part of sluggising (which converts whitespace to hypens and trims any trailing hyphens) which happens as labels are processed into URIs will remain.

Thanks all.