JuliaML / MLUtils.jl

Utilities and abstractions for Machine Learning tasks
MIT License
109 stars 22 forks source link

Tables.jl and DataAPI.jl interoperation #67

Closed bkamins closed 2 years ago

bkamins commented 2 years ago

@ablaom I am not sure if this is the best place to start this discussion, but it is a follow up to https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386 and https://github.com/JuliaData/Tables.jl/pull/278.

The key point is to avoid creating functions having essentially the same functionalities across DataAPI.jl, Tables.jl, and MLUtils.jl (possibly other ML packages I am not aware of).

Assume for a moment that Tables.jl table is a source of data for some ML model and you want operations to be efficient.

My understanding that your high-level workflow is the following:

  1. the user starts with a Tables.jl table.
  2. then the user does observation subsetting, feature selection, feature transformation operations on this table (either eagerly or lazily).
  3. finally the user transforms the result of step 2 to an object to some other type (again - either lazily or eagerly) to another value that can be accepted as an input by the ML algorithm.

The question is:

What functionalities you need to have in DataAPI.jl and Tables.jl so that it is efficient and you do not need to provide duplicate definitions of concepts in MLUtils.jl (or some other packages)? Another consideration (raised in the linked discussions) is that I would expect that what we develop is consistent with the interfaces that Base Julia already defines (e.g. iterator interface, abstract vector interface, indexing interface, view interface)

bkamins commented 2 years ago

CC @nalimilan @quinnj

AriMKatz commented 2 years ago

@darsnack @touchesir

AriMKatz commented 2 years ago

Also cc @manikyabard

CarloLucibello commented 2 years ago

I think the only methods we need in Tables.jl are

Now, since there is no AbstractTable type, is not clear how to achieve interoperability. One option is to change the generic fallbacks in https://github.com/JuliaML/MLUtils.jl/blob/main/src/observation.jl as follows:

function numobs(data)
  if istable(data)
    return numrows(data)
  else
    return length(data)
  end
end

function getobs(data, i)
  if istable(data)
    return getrow(data, i)
  else
    return data[i]
  end
end

Having those branches in such low-level functions is not great but I don't know how else we can support generic Tables.jl's tables here.

bkamins commented 2 years ago

x-ref to discussion in Tables.jl https://github.com/JuliaData/Tables.jl/pull/278

CarloLucibello commented 2 years ago

Closing this and leaving only #61 open