JuliaAI / ScientificTypes.jl

An API for dispatching on the "scientific" type of data instead of the machine type
MIT License
96 stars 8 forks source link

Towards a more efficient `schema` methods for row-based tables #127

Open ablaom opened 3 years ago

ablaom commented 3 years ago

The issue has been raised that schema, applied to a table, currently has to concretely manifest each column, as way of extracting it's scitype, which for row-based tables is inefficient. The reason for the current implementation is that, in general, the column element scitype cannot be inferred from column element machine type.

Here are some details so that someone interested can explore a workaround, which I think is certainly possible.

At present (and this might change) the only time the scitype of an array A cannot be determined from the machine type is if

eltype(A) <: CategoricalArrays.CategoricalValue    

This is because the scitype depends on: (i) whether or the pool is ordered, and (ii) the number of levels. Neither of these are in the machine type - they must be extracted from an instance. However, it is safe to assume that all elements have the same scitype, because it is very unusual for an array to have inhomogeneous pools (the CategoricalPool contains the order/levels information). Indeed, CategoricalArrays goes to great lengths to ensure creating of such arrays is difficult. Under this assumption, one can therefore compute the scitype of A by looking at just the first element (which for Tables, means looking just at the first row).

cc @OkonSamuel

ablaom commented 3 years ago

@OkonSamuel