JuliaAI / DecisionTree.jl

Julia implementation of Decision Tree (CART) and Random Forest algorithms
Other
356 stars 102 forks source link

Standardize the way fit! and predict methods take X matrix (features) #220

Closed pebeto closed 1 year ago

pebeto commented 1 year ago

Some Julia packages uses columns as rows. This library forces the user to take data as completely tabular, as if it were a dataframe.

Example:

rfc = RandomForestClassifier()
x = rand(3, 100)
y = rand(0:1, 100)
DecisionTree.fit!(rfc, x, y) # it doesn't work because of BoundsError.
ablaom commented 1 year ago

@pebeto Thanks for your post.

Some Julia packages uses columns as rows.

I guess you mean columns as observations?

Yes, the convention here is that each row of x represents a single observation. You are using the ScikitLearn.jl interface here (the fit! comes from that package) and that is the convention adopted there because that is the convention adopted in the python scikit-learn package.

So this works just fine:

using DecisionTree

rfc = RandomForestClassifier()
x = rand(100,3) # 100 observations
y = rand(0:1, 100) # 100 observations
DecisionTree.fit!(rfc, x, y)

The observation-as-rows convention is also the one adopted by the native API, and by the MLJ interface (which expects a table satisfying the Tables.jl API, such as a DataFrame, rather than a matrix).

By the way, this convention is the natural one for all tree models in Julia because tree models consume data one feature at a time, not one observation at a time as many other models do, and Julia arrays are column-major.

So, it seems to me there is already a standard in place, and I'm not clear about your request. Am I missing something?

pebeto commented 1 year ago

When we create a matrix not by a tabular data source, you're getting a columns * rows shape one where we can take rows as observations. My suggestion is something related to that, because there are libraries using this kind of "convention". So if there's some effort to change that from the original language behavior, there's no problem then. I thought it was something like "language feature".

ablaom commented 1 year ago

I'm still not clear on your proposal.

you're getting a columns rows* shape

? According to my understanding of the words "columns" and "rows", the shape of any matrix is (number of rows) x (number of cols).

Reopen this issue with more detail if you really think it is DecisionTree.jl-specific. Otherwise, a more appropriate forum might be Julia Discourse (eg, this thread).