JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

One-hot label encoding #18

Closed abieler closed 7 years ago

abieler commented 7 years ago

Suggestion to include functionality to switch labels between dense and one-hot representation. I can create a pull request if you like the idea.

function dense2onehot(y_dense)
    # Return labels one-hot encoded.
    # Subtract offset from labels such that 
    # minimum label == 1 (due to one based indexing)
    min_label = minimum(y_dense)
    offset = min_label - 1
    y_new = y_dense - offset
    nClasses = length(unique(y_new))
    nLabels = length(y_new)
    y_onehot = zeros(Int, (nClasses, nLabels))
    for (i,l) in enumerate(y_new)
        y_onehot[l, i] = 1
    end
    y_onehot
end

function onehot2dense(y_onehot)
    y_dense = zeros(Int, size(y_onehot, 2))
    for i in 1:length(y_dense)
        maxval, imax = findmax(y_onehot[:, i])
        y_dense[i] = imax
    end
    y_dense
end

and probably also a version as sparse matrix.

Evizero commented 7 years ago

Hi! Thanks for the suggestion. We surely want this package to provide some encoding capabilities, but I am still dwelling on how exactly they will look like.

An early implementation was in the predecessor of LearnBase here which build on MLBase. It had the nice property of storing the information what class index some string label represents, which is something we want.

The solution I aim for now will for consistency reasons depend on how exactly MLMetrics will end up handling this (a draft is here), which I will hopefully get to before the end of the year.

So I really would like to avoid the situation that a user has to first manually encode his/her string targets to one-based indices, and then encode those to one-hot vectors. That said, it would still be nice to expose such low-level functionality for those who care to use it.

So long story short, we could likely find a useful place for snippets of your code down the road in one form or the other. If you would like to contribute I would merge a PR in order to record your nice contribution in the git history, but chances are that the code will change and move around before being exposed in a tagged version of the library

abieler commented 7 years ago

No worries, it makes more sense to wait a little instead of implementing a premature thing.

Evizero commented 7 years ago

I address one hot encoding now over at https://github.com/JuliaML/MLLabelUtils.jl along with other common encoding formats. The implementation is pretty much done as of this moment and I will document it over the next few days hopefully