datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
600 stars 181 forks source link

Make the default CSV quote character NOT_A_CHAR if autodetection doesn't find quoted fields #1202

Open LosD opened 8 years ago

LosD commented 8 years ago

The autodetection currently assumes the double quote character (") for quoting CSV files, even if no double quoted fields has been found... This is especially an issue for product data, as there is a good chance that i.e. lengths has inches represented by double quote.

It seems better to assume no quote character if none is found.

ClaudiaPHI commented 8 years ago

The Metamodel issue https://issues.apache.org/jira/browse/METAMODEL-237 has been filled connected with this issue.

kaspersorensen commented 8 years ago

Since it seems that it's a pretty non-trivial fix in MetaModel, maybe we should consider some small tricks that would alleviate the issue in DC...

I'm thinking that if a file has a quote character, although it's not required, it will usually be quite frequent. So one thing we could consider doing is checking that the number of quotes that we identify is not only greater than number of alternative quote chars, but also above some "significance threshold".

Since we work with varying size files and samples, this threshold should be relative to the amount of identified delimiters I think.

In a very normalized file, the amount of quote chars would be 2x the amount of delimiters. But many files only have quotations when needed or for specific fields. This makes it difficult, and we would be guesstimating I think ... So bring your guesses and gut feelings.

My gut feeling would be to have a significance threshold close to half of the number of delimiters:

int quotationCharSignificanceThreshold = (int) (delimiterCount * 0.5)
LosD commented 8 years ago

Hmmm... While I guess anything is better than nothing, I seem to recall that I often have seen it where only a a few columns has it, like a description and name fields, simply because other fields are ids, states or numbers.

However, I have seen quite a but of JSON lately, where that is more probable, so that may just be me being colored by irrelevant experience :)

kaspersorensen commented 8 years ago

That's also my experience (that it's pretty common to have CSV with only quotes on some fields). Which is why I absolutely call this an alleviation and a gut feeling.