Problem with quotes in tsv files

Clever / csvlint

library and command line tool that validates a CSV file

Apache License 2.0

189 stars 20 forks source link

Problem with quotes in tsv files #40

Open boydkelly opened 2 months ago

boydkelly commented 2 months ago

When linting tsv files, I get:

$ csvlint -delimiter='\t' build/neo_ex.tsv 
Warning: not using defaults, may not validate CSV to RFC 4180
Record #1035 has error: bare " in non-quoted-field

unable to parse any further

The record 1035 is as follows. But since this is tsv (for this very reason) should any quoting not be totally ignored as an error?

9010c36f-6958-48d9-ba2d-c50f65c8825d    dondon ko "ken ken kileri kɛ".  dyu exm dyuEx

kmatt commented 2 months ago

Parsing and detecting errors in this utility is handled by https://pkg.go.dev/encoding/csv#Reader

Which seems to complain if the quotes are not the first or last character in the field.

In your sample text is the double quoted field delimited by tabs as in dondon ko\t"ken ken kileri kɛ".\tdyu ?
Or is there whitespace before the leading quote as in dondon ko\t "ken ken kileri kɛ".\tdyu ?

Only the second case throws the error for me.

boydkelly commented 2 months ago

It certainly could be the second case. Since this is foreign language prose and not 'clean' text the expectation is that when it is defined as tab delimited then it should not matter if and where any quote may occur. So in your second example the text should 'properly' lint as with \t replaced by line feed:

dondon "ken ken kileri kɛ". dyu

So it looks like the bug is with csv#Reader?

I'm really just checking that the number of columns is accurate. And for now Awk will do the job, But it would be great to see tsv handled correctly here.

kmatt commented 2 months ago

So it looks like the bug is with csv#Reader?

I'm not certain if its a bug or not, because the Reader docs are not explicit on tab delimited data.

-lazyquotes may be an option in this case.

boydkelly commented 2 months ago

I'll just use awk. The whole point of tab delimiters is to avoid the numerous problems of quote delimiters. In a tab delimited file quotes should not be considered as anything but another string character. I guess csv#Reader is true to its name, comma separated. It does not understand tabs correctly.