Closed bruth closed 8 years ago
For values that have a fixed set of choices, another error can be introduced to ensure the value matches one of the choices. Unfortunately, this information is not available in the data model definition at this time. Would it be worth the time investment to add this information?
The choices feature ... so attractive. Maybe it should be the first thing implemented after the validator is working, since we don't do choice-checking now.
What file types will be supported? Presumably the PEDSnet flavor of CSV, initially?
Re: proposed errors, aside from the field value errors you mention, the errors that we have seen in loading the PEDSnet CSV files include:
What file types will be supported? Presumably the PEDSnet flavor of CSV, initially?
Yes the PEDSnet flavor of CSV is the first format.
Interesting that you run up against formatting errors. I would expect/hope that whatever tool the sites are using to generate the files would handle these things. Have the sites been surveyed with what tools they use to generate the CSV files?
I don't think we've surveyed the sites, but it's pretty clear that some of them are using hand-rolled CSV export code. Also, since there are various flavors of CSV (which is one reason it's kind of a crappy format), even if the sites use robust export tools, they could wind up sending a conditionally quoted CSV file, or an unquoted CSV file, or one with a different placeholder for NULL, etc.
[...] but it's pretty clear that some of them are using hand-rolled CSV export code.
Well that is terrible.
since there are various flavors of CSV (which is one reason it's kind of a crappy format)
I definitely agree. Do you think given the difficulties with producing and consuming sane CSV files, the sites would be willing to try a different format? However, if sites are hand-rolling a CSV writer I fear they would try to hand-roll other formats as well.
I started a PR (https://github.com/chop-dbhi/data-models/pull/84) to document examples of generating CSV files in common languages and environments. We can point sites to this so they don't attempt to hand-write their own.
I have definitely run into encoding issues. A subtle version of type mismatch is ""
in a non-string field, which is essentially a confusion of NULL
and empty string. Another is the time in a datetime field phenomenon. The non-escaped quote mark is probably the most common one I see, and it's deceptively difficult to parse. For example, try parsing the below string value:
"1","3","2015-08-12","",,"We did "foo", instead of "bar"","mL"
@aaron0browne Thanks. Could you survey the sites to see how they are generating the CSV files? Using a well-tested library would alleviate most of these issues.
Survey created in PEDSnet/Data_Models#152
Can either of you email me a slice of one or more bad files?
I dropped a 40000 line slice from the beginning of a drug_exposure file into your home dir on resrhpcori02. It probably won't have all the problems, but I know it has some of them.
Great, I appreciate it.
On Aug 7, 2015, at 2:13 PM, Aaron Browne notifications@github.com wrote:
I dropped a 40000 line slice from the beginning of a drug_exposure file into your home dir on resrhpcori02. It probably won't have all the problems, but I know it has some of them.
— Reply to this email directly or view it on GitHub.
I pushed the initial commit. Here are the first set of errors: https://github.com/chop-dbhi/data-models-validator/blob/master/errors.go
Each error is assigned a code in a code series corresponding to the type of error. Codes also provide shorthand way referencing them. However I thought about using an alpha prefix rather then a numerical range, such E00
for the UTF-8 encoding error and P00
for a parse error of the header.
Correspondingly, here are the current set of validators implemented: https://github.com/chop-dbhi/data-models-validator/blob/master/validators.go
Considering this done until new errors are thought of.
The validator's job is to detect and aggregate errors for an input file. Each error reported will report the line number that contains the error.
Proposed errors:
ErrTypeMismatch
- The value does not match the schema type, such asfoo
for an number type.ErrRequiredValue
- The value is required, but is empty.ErrPrecisionExceeded
- The precision of a numeric value is greater than what is allowed.ErrLengthExceeded
- The length of value (string or number) is greater than what is allowed.ErrScaleExceeded
- The scale of a numeric value is greater than what is allowed./cc @chop-dbhi/data-models Please add to this based on your experience and suggest alternate messages that can be reported.