digital-preservation / csv-schema

CSV Schema
http://digital-preservation.github.io/csv-schema
Mozilla Public License 2.0
98 stars 33 forks source link

Question about elaboration tolerance and ordering #29

Open rpgoldman opened 4 years ago

rpgoldman commented 4 years ago
  1. Maybe I missed it, but I couldn't tell if the columns in a CSV file one is checking must come in the same order as they are listed in the body of a CSV schema.
  2. Assuming that the prolog does not specify the column count, is it acceptable to have additional columns that do not match a column entry in the body, and have them just be unchecked?

I am interested in using the validator for some scientific data where there is a known set of columns that should be checked for reasonable contents, but where I'm not sure that the ordering of columns will be consistent, and where some data providers might have added additional columns of computed values to the raw values that my schema should check.

Thank you

DavidUnderdown commented 4 years ago

The ordering must match. Using the totalColumns directive means that the validator checks that there are the expected number of column definitions given at parse time. If you do not specify it there will still be a validation error once the CSV file is actually read if the number of column definitions does not match the number of columns in the file.

There are some similar issues already #21 and #13, but I'm afraid we've not had resource availableto work on further developments recently, though we would welcome pull requests from others.

rpgoldman commented 4 years ago

Thanks for the response.

I suggested making the order optional because CSVs are often interpreted by tools like python's Pandas, in which the columns are name-addressable, so column ordering is not required for correct operation.

And I mentioned in my original comments that for scientific data there are often additional columns of derived quantities added that don't interfere with correct (assuming name-based addressing) processing of the data.

I imagine that these additional features could add substantially to the difficulty of validation, though.

Maybe this should be tagged as "question-edging-into-enhancement-request"!