Data-Liberation-Front / csvlint.rb

The gem behind http://csvlint.io
MIT License
287 stars 88 forks source link

Emit a warning when the CSV header does not match the supplied schema #127

Closed adamc00 closed 9 years ago

adamc00 commented 9 years ago

This feature fixes #126. Happy to make changes if requested. Ruby is not my first language.

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.03%) to 96.45% when pulling 5048ccac47d9baaef762bb19be3179a6f0104e10 on strategicdata:validate-header-size into 182b35f961b1d368f889c6f18cba3f8e00b45ae9 on theodi:master.

Floppy commented 9 years ago

Thanks for the PR! Just wondering - aside from the fact that it causes a crash (which is bad), should this sort of error not show just as missing/extra columns, rather than a specific count error? Or is that useful anyway?

Floppy commented 9 years ago

@ldodds @pezholio any thoughts on that?

ldodds commented 9 years ago

I think it'd be useful to have a single warning of the names of the extra headers (columns), to at least indicate they're being ignored as part of the schema validation.

I don't think it should be an error as an application consuming the data should just ignore the extra columns if they were present. (We might separately want a "strict" mode though).

adamc00 commented 9 years ago

I think showing the difference between the found and expected columns is a good idea rather than the count as I have done. I'll have a go at that instead.

@ldodds note that there may also be fewer columns than expected so it's not just about too many. Should too few be an error - since that indicates that required data is missing - whereas too many can safely be ignored with a warning?

:+1: for a strict mode BTW. For our purposes I have elevated nearly all warnings to errors. The only things we treat as true warnings are those regarding the missing extra metadata headers regarding the CSV file's attributes.

ldodds commented 9 years ago

Doesn't the missing_column warning cover there being fewer items already?

It's worth checking on the trigger for extra_column too which is in the same territory.

adamc00 commented 9 years ago

@idodds don't the missing_column and extra_column warnings apply only to the data rows, not the header row? Currently if you have too few headers you get a title_row error with little other information in it. If in addition to having too few header names you also have mismatched header names (compared to the supplied schema) you get a header_name error for each instance.

It seems to me that when you supply a schema the title_row error probably isn't appropriate as this is incorrectly guessing that the supplied CSV file contains a title rather then column headings in the first row. Perhaps this should be suppressed in the case when validating against a schema and instead raise a new 'malformed_header' error containing the array of header names found in the content attribute of the error/warning and the expected schema supplied header names in the constraints attribute? This could also replace the variable number of header_name errors emitted as they would now be redundant.

adamc00 commented 9 years ago

I have reworked this PR to simply emit a malformed_header schema warning when the supplied header does not match the schema. The header that was found is reported in the content attribute and the schema header is reported in the constraints attribute.

adamc00 commented 9 years ago

Could I get some feedback on this PR if it isn't acceptable? I may have handled fixing #126 the wrong way but it still exists. This patch to your master spec/schema_spec.rb demonstrates the bug. Thanks!

Floppy commented 9 years ago

We're taking a look today - thanks for the work :)

pezholio commented 9 years ago

Yup, this is great - thanks so much :+1:

pezholio commented 9 years ago

We've got another change coming up the pipe, and we'll put add this to 1.0.2 later today

adamc00 commented 9 years ago

That's great! Thanks so much for your work.