Closed bkamins closed 2 years ago
CC @nalimilan @quinnj
@darsnack @touchesir
Also cc @manikyabard
I think the only methods we need in Tables.jl are
numrows(table)
returning the number of rows (already available as length(rows(table))
)getrow(tables, i::Int)
returning a materialized row (similar to df[i, :]
for DataFrame)getrow(tables, i::AbstractVector{<:Integer})
returning a materialized subtable (again similar to df[i, :]
for DataFrame)Now, since there is no AbstractTable
type, is not clear how to achieve interoperability. One option is to change the
generic fallbacks in https://github.com/JuliaML/MLUtils.jl/blob/main/src/observation.jl as follows:
function numobs(data)
if istable(data)
return numrows(data)
else
return length(data)
end
end
function getobs(data, i)
if istable(data)
return getrow(data, i)
else
return data[i]
end
end
Having those branches in such low-level functions is not great but I don't know how else we can support generic Tables.jl's tables here.
x-ref to discussion in Tables.jl https://github.com/JuliaData/Tables.jl/pull/278
Closing this and leaving only #61 open
@ablaom I am not sure if this is the best place to start this discussion, but it is a follow up to https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386 and https://github.com/JuliaData/Tables.jl/pull/278.
The key point is to avoid creating functions having essentially the same functionalities across DataAPI.jl, Tables.jl, and MLUtils.jl (possibly other ML packages I am not aware of).
Assume for a moment that Tables.jl table is a source of data for some ML model and you want operations to be efficient.
My understanding that your high-level workflow is the following:
The question is:
What functionalities you need to have in DataAPI.jl and Tables.jl so that it is efficient and you do not need to provide duplicate definitions of concepts in MLUtils.jl (or some other packages)? Another consideration (raised in the linked discussions) is that I would expect that what we develop is consistent with the interfaces that Base Julia already defines (e.g. iterator interface, abstract vector interface, indexing interface, view interface)