ONSdigital / csvw-check

A CLI to validate CSV-Ws (W3C's CSV on the Web standard).
Apache License 2.0
1 stars 1 forks source link

Implement Number Format Parser for validation #78

Closed josepajay closed 2 years ago

josepajay commented 3 years ago

We've come to the conclusion after looking at problems related to issues #76, #74 which show that we're not implementing number format validation in the way that the W3C CSV-W spec suggests we should be. Admittedly, it isn't very clear on how the validation should work, but the test cases suggest that we're not quite doing it right.

For instance, a format using an optional digit char (#) in the fraction part, i.e. 000.00E#0 suggests that the number 123.45E67 should be valid, but 12345E678 should not be valid because there are too many digits there. Unfortunately the IBM-ICU tool doesn't recognise said format even though it's in one of the CSV-W test cases.

Essentially the IBM-ICU library is focused on using the UTS-35 spec to format numbers into strings, whereas the W3C CSV-W spec requires us to use the same characters but to use them to parse numbers. So optional digits add some complexity that we can't work around without writing a parser.

So, we need to prototype and implement a parser which ensures that we pass the W3C CSV-W tests with numbers:

See the W3C validation test cases here

robons commented 3 years ago

I've done the work to implment an initial approach to validation of number formats (and numbers represented with them) using scala's parser combinator functionality. I've added a number of unit tests to describe/ensure functionality.

Work yet to do: