Closed ablaom closed 5 years ago
maybe auto_scitypes
or auto_types
?
As a side note, I think it would be nice if this was called by default when a user loads data from file using a standard command (e.g. something that wraps around CSV.jl).
X = readcsv("path/to/file.csv") # default = auto_types
X = readcsv("path/to/file.csv"; auto_types=false) # if the user wants to do everything themselves
X = readcsv("path/to/file.csv"; types=(...)) # pre-specification and then no auto-types needed
Yes, something like that sounds good. The base method itself, to be implemented here, would be table agnostic, ie not restricted to DataFrames.
Here's a possible draft (I can't dev Scitypes
due to the ongoing release)
function suggest_scitype(type, col, nrows)
# TODO: deal with missing/union (?)
unique_vals = unique(col)
nunique_vals = length(unique_vals)
# heuristic for when the number of unique values seems "pretty low"
# as compared to the number of rows.
# Heuristic 1:
# there's more than 15 rows and ≤ 3 unique values ==> multiclass
# Heuristic 2:
# there's less than 10% unique values out of the number of rows or
# there's fewer than 100 unique values with more than 1_000 rows
# => if a number, return OrderedFactor
# => if something else, return Multiclass
# Finally if there are many unique values, just use the standard heuristic
if nunique_vals ≤ 3 && nrows > 15
# extremely likely this is a binary case
return Multiclass{nunique_vals}
elseif nunique_vals < min(0.1*nrows, 100)
type <: Real && return OrderedFactor{nunique_vals}
return Multiclass{nunique_vals}
else
type <: AbstractFloat && return Continuous
type <: Integer && return Count
# XXX to complete
return Unknown
end
end
function auto_types(X)
sch = schema(X)
suggested_types = Dict{Symbol,Found}()
for (name, type, col) in zip(sch.names, sch.types, eachcolumn(X))
suggested_types[name] = suggest_scitype(type, col, sch.nrows)
end
return suggested_types
end
let me know what you think.
Edit: this would live as part of a convention I guess.
Looks good to me. Could go in src/conventions/mlj/auto_types.jl or similar.
@tlienart Have added a dev branch and given you write access.
closed by #4
Would output a dictionary
changes
on column names, whose values are the suggested scitype for the column with that name. The user could could then edit the dictionary and then effect the type coercions withcoerce(changes, data)
.So, if the new method is called
suggest_scitypes
(better name?) the workflow would be something like: