chop-dbhi / data-models-validator

Set of tools for validating data that should conform to a data model.
1 stars 2 forks source link

Handle other quoting conventions #24

Closed murphyke closed 8 years ago

murphyke commented 8 years ago

This issue is partly to document that the validator requires quoted values (except for empty ones). For general use, it might be better to have more flexible parsing. Some tools, like Excel, quote only as necessary, and most CSV parsers have a parsing mode that accommodates this.

murphyke commented 8 years ago

@bruth The v2.2 data is being submitted using new requirements. In accordance with Aaron's PEDSnet DCC automation plan, can you make the necessary changes to the validator?

The changes can be summarized as "use normal CSV parsing", i.e.: quotation marks are not required around values (except when a text field contains a comma), but quotation marks may be used in any case. In particular, it's OK to have empty values that are not surrounded by quotes.

Oracle SQL Developer and SQLcl will output CSV like the following:

$ cat export.sqldeveloper.csv 
"ASTR","BSTR","ANINT","ADATE","ANUM"
"1/2"" tube","""why not"" he asked",12,,1234567890.123456789
"","huh",,2016-01-11 06:00:56,1234567890.123456789
"plain","plane",,2016-01-26 06:00:47,1234567890.123456789

I.e. string fields are unconditionally quoted, and nothing else is.

PostgreSQL has more control over CSV output, but the default format for the above data is as follows:

$ cat export.pg.csv
astr,bstr,anint,adate,anum
"1/2"" tube","""why not"" he asked",12,,1234567890.123456789
"",huh,,2016-01-11 06:00:56,1234567890.123456789
plain,plane,,2016-01-26 06:00:47,1234567890.123456789

I.e. nothing is quoted unless it absolutely has to be. In the case of the value of astr in the second row, that is an empty string that PG quotes in order to distinguish it from a NULL, which would not be quoted. We are going to change our import code so empty strings and NULL strings are treated the same.

murphyke commented 8 years ago

@bruth When do you think you might be able to look at this one? I suspect we could move to using the vanilla Go csv package, which would actually be much better for a pseudo-generic data-models-validator, anyway. Thanks!

bruth commented 8 years ago

I will in a bit. The reason we moved away from the built-in csv package was because it was quite slow for the data sizes.

murphyke commented 8 years ago

Interesting about the slowness.

bruth commented 8 years ago

The other reason was control over parse errors. Once an error occurs in the reader, you cannot recover from it. The initial strategy was to buffer lines and initialize new readers every time an error occurred, but that added more overhead.

bruth commented 8 years ago

it's OK to have empty values that are not surrounded by quotes

This is already supported. The test file comments on several situations.

So I just need to handle non-empty unquoted values, correct?

murphyke commented 8 years ago

@bruth Yes, so a header line like:

astr,bstr,anint,adate,anum

would not be flagged as an error.

bruth commented 8 years ago

@murphyke Opened PR to handle unquoted fields. This for loop contains the relevant logic. Now the scan only fails if a double quote is found since they are expected to be quoted (and escaped). Does this match the expectation?

murphyke commented 8 years ago

@bruth Thanks. Yes, that is right. I'll look at the PR.