alan-turing-institute / CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
https://clevercsv.readthedocs.io
MIT License
1.25k stars 72 forks source link

consistency.get_best_set has unexpected results #5

Closed Andrew-Sheridan closed 4 years ago

Andrew-Sheridan commented 4 years ago

Summary

In short, if one of the scores has a Q = nan, then the max score could be nan, which is weird.

Details

I was trying to get the dialect for a CSV file and was getting dialects = None. I dug around through the code and found two functions which may be part of the issue: get_best_set and consistency_scores.

I got some dialects using clevercsv.potential_dialects.get_dialects, and then some scores using clevercsv.consistency.consistency_scores

I had a set of scores that looked like this:

{SimpleDialect(',', '', ''): {'Q': 52.58875739644971,
  'pattern': 61.53846153846154,
  'type': 0.8545673076923077},
 SimpleDialect(',', '', '/'): {'Q': nan,
  'pattern': 30.76846153846154,
  'type': nan},
 SimpleDialect('', '', ''): {'Q': nan, 'pattern': 0.064, 'type': nan},
...

There were many other scores I have just grabbed the first three here.

I would expect the first dialect to be the "best" one, but that is not the output :(

Passing those scores in get_best_set returns an empty set.

get_best_set currently looks like this:

def get_best_set(scores):
    H = set()
    Qmax = max((score["Q"] for score in scores.values()))
    H = set([d for d, score in scores.items() if score["Q"] == Qmax])
    return H

It just picks out the item which has the best Q score. The line Qmax = max((score["Q"] for score in scores.values())) depends on the builtin max function. That function produces unexpected results when some of the values it is checking include nan. See: https://stackoverflow.com/questions/4237914/python-max-min-builtin-functions-depend-on-parameter-order

Because of that, Qmax could equal nan, and then the output of get_best_set will be the empty set. (It will not return the other entries that do have Q = 'nan', because float("nan") == float("nan") is False...)

Why are there nan values?

consistency_scores has nan as default values.

def consistency_scores(data, dialects, skip=True, logger=print):
    scores = {}

    Qmax = -float("inf")
    for dialect in sorted(dialects):
        P = pattern_score(data, dialect)
        if P < Qmax and skip:
            scores[dialect] = {
                "pattern": P,
                "type": float("nan"),
                "Q": float("nan"),
            }
            logger("%15r:\tP = %15.6f\tskip." % (dialect, P))
            continue
        T = type_score(data, dialect)
        Q = P * T
        Qmax = max(Q, Qmax)
        scores[dialect] = {"pattern": P, "type": T, "Q": Q}
        logger(
            "%15r:\tP = %15.6f\tT = %15.6f\tQ = %15.6f" % (dialect, P, T, Q)
        )
    return scores

The defaults are set here:

            scores[dialect] = {
                "pattern": P,
                "type": float("nan"),
                "Q": float("nan"),
            }

So if scores has one score with a nan, then nan could be the result.. surely that is not correct.

I think that the defaults should be None, and then checks for if Q is None etc could be made elsewhere.

Thoughts?

Full set of scores:

SimpleDialect('_', '', '') {'Q': nan, 'type': nan, 'pattern': 0.48983333333333334}
SimpleDialect(':', '', '') {'Q': nan, 'type': nan, 'pattern': 0.2815}
SimpleDialect('-', '', '') {'Q': nan, 'type': nan, 'pattern': 6.31279292929293}
SimpleDialect('#', '', '') {'Q': nan, 'type': nan, 'pattern': 4.823358974358975}
SimpleDialect(' ', '', '') {'Q': nan, 'type': nan, 'pattern': 4.8963066685493155}
SimpleDialect('&', '', '') {'Q': nan, 'type': nan, 'pattern': 4.3835}
SimpleDialect('!', '', '') {'Q': nan, 'type': nan, 'pattern': 3.525}
SimpleDialect('', '', '') {'Q': nan, 'type': nan, 'pattern': 0.064}
SimpleDialect('*', '', '') {'Q': nan, 'type': nan, 'pattern': 0.48604545454545456}
SimpleDialect('¨', '', '') {'Q': nan, 'type': nan, 'pattern': 3.514666666666667}
SimpleDialect('?', '', '') {'Q': nan, 'type': nan, 'pattern': 4.627111111111111}
SimpleDialect('*', '', '/') {'Q': nan, 'type': nan, 'pattern': 0.48150000000000004}
SimpleDialect(',', '', '/') {'Q': nan, 'type': nan, 'pattern': 30.76846153846154}
SimpleDialect('+', '', '') {'Q': nan, 'type': nan, 'pattern': 0.2815}
SimpleDialect(',', '', '') {'Q': 52.58875739644971, 'type': 0.8545673076923077, 'pattern': 61.53846153846154}
Andrew-Sheridan commented 4 years ago

Also, separately, get_bets_scores could be written like this:

def get_best_set(scores):
    Qmax = max((score["Q"] for score in scores.values()))
    return {[d for d, score in scores.items() if score["Q"] == Qmax]}
Andrew-Sheridan commented 4 years ago

Note that if you simply change from max to np.nanmax then this its not really a problem. np.nanmax([float("nan"), 1]) == 1

Also note that I didnt really delve too deep in to this (or into the associated paper) so I'm not sure if nan is supposed to be "better" than a number.

GjjvdBurg commented 4 years ago

Thanks for your detailed bug report @Andrew-Sheridan! This kind of report makes it really easy to figure out what the problem is, so is much appreciated!

I was not aware of the unexpected behaviour of the max function, so thanks for alerting me to that. I've switched the code over to using None instead, to avoid confusion and solve this issue.

To briefly expand on why these cases occur: we detect the best dialect by computing what we call a "data consistency measure", Q, which is the product of a pattern score (P) that looks at how consistent the rows in the resulting table are, and the type score (T) that computes the ratio of cells with a known data type. Since T is between 0 and 1 we can skip dialects for which P is lower than the current maximum Q score we've seen already. These dialects get a Q score of None to mark them as 'skipped'.

I'm preparing a fix now and will release an updated version of the package asap. Thanks again for reporting this problem!