Closed Andrew-Sheridan closed 4 years ago
Also, separately, get_bets_scores
could be written like this:
def get_best_set(scores):
Qmax = max((score["Q"] for score in scores.values()))
return {[d for d, score in scores.items() if score["Q"] == Qmax]}
Note that if you simply change from max
to np.nanmax
then this its not really a problem. np.nanmax([float("nan"), 1]) == 1
Also note that I didnt really delve too deep in to this (or into the associated paper) so I'm not sure if nan
is supposed to be "better" than a number.
Thanks for your detailed bug report @Andrew-Sheridan! This kind of report makes it really easy to figure out what the problem is, so is much appreciated!
I was not aware of the unexpected behaviour of the max function, so thanks for alerting me to that. I've switched the code over to using None
instead, to avoid confusion and solve this issue.
To briefly expand on why these cases occur: we detect the best dialect by computing what we call a "data consistency measure", Q, which is the product of a pattern score (P) that looks at how consistent the rows in the resulting table are, and the type score (T) that computes the ratio of cells with a known data type. Since T is between 0 and 1 we can skip dialects for which P is lower than the current maximum Q score we've seen already. These dialects get a Q score of None
to mark them as 'skipped'.
I'm preparing a fix now and will release an updated version of the package asap. Thanks again for reporting this problem!
Summary
In short, if one of the scores has a Q =
nan
, then the max score could benan
, which is weird.Details
I was trying to get the dialect for a CSV file and was getting
dialects = None
. I dug around through the code and found two functions which may be part of the issue:get_best_set
andconsistency_scores
.I got some dialects using
clevercsv.potential_dialects.get_dialects
, and then some scores usingclevercsv.consistency.consistency_scores
I had a set of scores that looked like this:
There were many other scores I have just grabbed the first three here.
I would expect the first dialect to be the "best" one, but that is not the output :(
Passing those scores in
get_best_set
returns an empty set.get_best_set
currently looks like this:It just picks out the item which has the best Q score. The line
Qmax = max((score["Q"] for score in scores.values()))
depends on the builtinmax
function. That function produces unexpected results when some of the values it is checking includenan
. See: https://stackoverflow.com/questions/4237914/python-max-min-builtin-functions-depend-on-parameter-orderBecause of that,
Qmax
could equalnan
, and then the output ofget_best_set
will be the empty set. (It will not return the other entries that do have Q = 'nan', becausefloat("nan") == float("nan")
is False...)Why are there
nan
values?consistency_scores
hasnan
as default values.The defaults are set here:
So if scores has one score with a
nan
, thennan
could be the result.. surely that is not correct.I think that the defaults should be
None
, and then checks forif Q is None
etc could be made elsewhere.Thoughts?
Full set of scores: