JuliaAI / ScientificTypes.jl

An API for dispatching on the "scientific" type of data instead of the machine type
MIT License
96 stars 8 forks source link

Add method to guess the scitypes of a table #2

Closed ablaom closed 5 years ago

ablaom commented 5 years ago

Would output a dictionary changes on column names, whose values are the suggested scitype for the column with that name. The user could could then edit the dictionary and then effect the type coercions with coerce(changes, data).

So, if the new method is called suggest_scitypes (better name?) the workflow would be something like:

juliia> X = (book=["red", "white", "blue", "blue"], 
               number=[0, 1, 1, 0, 1],
               is_broken=[0, 1, 1, 0])

julia> changes = suggest_scitypes(X)
Dict{Symbol,Any} with 3 entries:
  :book => Multiclass
  :number => OrderedFactor
  :is_broken => OrderedFactor

julia> changes[:number] = :Count 
julia> X_fixed = coerce(changes, X)
tlienart commented 5 years ago

maybe auto_scitypes or auto_types?

As a side note, I think it would be nice if this was called by default when a user loads data from file using a standard command (e.g. something that wraps around CSV.jl).

X = readcsv("path/to/file.csv") # default = auto_types
X = readcsv("path/to/file.csv"; auto_types=false) # if the user wants to do everything themselves
X = readcsv("path/to/file.csv"; types=(...)) # pre-specification and then no auto-types needed
ablaom commented 5 years ago

Yes, something like that sounds good. The base method itself, to be implemented here, would be table agnostic, ie not restricted to DataFrames.

tlienart commented 5 years ago

Here's a possible draft (I can't dev Scitypes due to the ongoing release)

function suggest_scitype(type, col, nrows)
   # TODO: deal with missing/union (?)
   unique_vals  = unique(col)
   nunique_vals = length(unique_vals)
   # heuristic for when the number of unique values seems "pretty low"
   # as compared to the number of rows.
   # Heuristic 1:
   #     there's more than 15 rows and ≤ 3 unique values ==> multiclass
   # Heuristic 2:
   #     there's less than 10% unique values out of the number of rows or
   #     there's fewer than 100 unique values with more than 1_000 rows
   #     => if a number, return OrderedFactor
   #     => if something else, return Multiclass
   # Finally if there are many unique values, just use the standard heuristic
   if nunique_vals ≤ 3 && nrows > 15
      # extremely likely this is a binary case
      return Multiclass{nunique_vals}
   elseif nunique_vals < min(0.1*nrows, 100)
      type <: Real && return OrderedFactor{nunique_vals}
      return Multiclass{nunique_vals}
   else
      type <: AbstractFloat  && return Continuous
      type <: Integer        && return Count
      # XXX to complete
      return Unknown
   end
end

function auto_types(X)
   sch = schema(X)
   suggested_types = Dict{Symbol,Found}()
   for (name, type, col) in zip(sch.names, sch.types, eachcolumn(X))
      suggested_types[name] = suggest_scitype(type, col, sch.nrows)
   end
   return suggested_types
end

let me know what you think.

Edit: this would live as part of a convention I guess.

ablaom commented 5 years ago

Looks good to me. Could go in src/conventions/mlj/auto_types.jl or similar.

@tlienart Have added a dev branch and given you write access.

tlienart commented 5 years ago

closed by #4