chop-dbhi / data-models-validator

Set of tools for validating data that should conform to a data model.
1 stars 2 forks source link

Define a standard set of errors #2

Closed bruth closed 8 years ago

bruth commented 9 years ago

The validator's job is to detect and aggregate errors for an input file. Each error reported will report the line number that contains the error.

Proposed errors:

/cc @chop-dbhi/data-models Please add to this based on your experience and suggest alternate messages that can be reported.

bruth commented 9 years ago

For values that have a fixed set of choices, another error can be introduced to ensure the value matches one of the choices. Unfortunately, this information is not available in the data model definition at this time. Would it be worth the time investment to add this information?

murphyke commented 9 years ago

The choices feature ... so attractive. Maybe it should be the first thing implemented after the validator is working, since we don't do choice-checking now.

What file types will be supported? Presumably the PEDSnet flavor of CSV, initially?

Re: proposed errors, aside from the field value errors you mention, the errors that we have seen in loading the PEDSnet CSV files include:

bruth commented 9 years ago

What file types will be supported? Presumably the PEDSnet flavor of CSV, initially?

Yes the PEDSnet flavor of CSV is the first format.

Interesting that you run up against formatting errors. I would expect/hope that whatever tool the sites are using to generate the files would handle these things. Have the sites been surveyed with what tools they use to generate the CSV files?

murphyke commented 9 years ago

I don't think we've surveyed the sites, but it's pretty clear that some of them are using hand-rolled CSV export code. Also, since there are various flavors of CSV (which is one reason it's kind of a crappy format), even if the sites use robust export tools, they could wind up sending a conditionally quoted CSV file, or an unquoted CSV file, or one with a different placeholder for NULL, etc.

bruth commented 9 years ago

[...] but it's pretty clear that some of them are using hand-rolled CSV export code.

Well that is terrible.

since there are various flavors of CSV (which is one reason it's kind of a crappy format)

I definitely agree. Do you think given the difficulties with producing and consuming sane CSV files, the sites would be willing to try a different format? However, if sites are hand-rolling a CSV writer I fear they would try to hand-roll other formats as well.

bruth commented 9 years ago

I started a PR (https://github.com/chop-dbhi/data-models/pull/84) to document examples of generating CSV files in common languages and environments. We can point sites to this so they don't attempt to hand-write their own.

gracebrownecodes commented 9 years ago

I have definitely run into encoding issues. A subtle version of type mismatch is "" in a non-string field, which is essentially a confusion of NULL and empty string. Another is the time in a datetime field phenomenon. The non-escaped quote mark is probably the most common one I see, and it's deceptively difficult to parse. For example, try parsing the below string value:

"1","3","2015-08-12","",,"We did "foo", instead of "bar"","mL"

bruth commented 9 years ago

@aaron0browne Thanks. Could you survey the sites to see how they are generating the CSV files? Using a well-tested library would alleviate most of these issues.

gracebrownecodes commented 9 years ago

Survey created in PEDSnet/Data_Models#152

bruth commented 9 years ago

Can either of you email me a slice of one or more bad files?

gracebrownecodes commented 9 years ago

I dropped a 40000 line slice from the beginning of a drug_exposure file into your home dir on resrhpcori02. It probably won't have all the problems, but I know it has some of them.

bruth commented 9 years ago

Great, I appreciate it.

On Aug 7, 2015, at 2:13 PM, Aaron Browne notifications@github.com wrote:

I dropped a 40000 line slice from the beginning of a drug_exposure file into your home dir on resrhpcori02. It probably won't have all the problems, but I know it has some of them.

— Reply to this email directly or view it on GitHub.

bruth commented 9 years ago

I pushed the initial commit. Here are the first set of errors: https://github.com/chop-dbhi/data-models-validator/blob/master/errors.go

bruth commented 9 years ago

Each error is assigned a code in a code series corresponding to the type of error. Codes also provide shorthand way referencing them. However I thought about using an alpha prefix rather then a numerical range, such E00 for the UTF-8 encoding error and P00 for a parse error of the header.

Correspondingly, here are the current set of validators implemented: https://github.com/chop-dbhi/data-models-validator/blob/master/validators.go

bruth commented 8 years ago

Considering this done until new errors are thought of.