JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.79k stars 158 forks source link

Adding MulticlassPerceptronClassifier #211

Closed davidbp closed 4 years ago

davidbp commented 5 years ago

Hello,

I've coded a MulticlassPerceptron model that I would like to inlcude in MLJ. I tried to follow option 1 form the guideline.

The code can be found in the MulticlassPerceptron repository. If you want to ensure the code works you should call basic_usage_train.jl from the repo. This script simply trains a model on the MNIST data and evaluates the model.

Let's see if I can start contributing seriously to MLJ. First I want to port a very simple model and keep learning from here.

Thank you for your time!

ablaom commented 5 years ago

Hi @davidbp,

That sounds exciting. What version of MLJBase are you using? Do you perhaps have a Project.toml file you forgot to push?

ablaom commented 5 years ago

Okay, I've had a bit of a look over the code, which looks very nice, and I think you are very close to getting an MLJ compliant interface.

Couple of challenges for me:

  1. In trying to turn your basic_usage_train.jl into a test suite (by moving to test/runtests.jl and adding [extras] to the Project.toml, which I had to create) I encountered some pkg compatibility problems with MLDatasets, which I could not understand, but which I'm guessing have to do with the fact that this package is 0.7 with no Project.toml file.

  2. I could not quickly determine what form of data X you are allowing. This is not articulated with an input_scitype (formerly input_scitype_union) trait definition, but I'm guessing this is because you don't want to support only tabular data. (Actually, you don't have to, under MLJBase 0.2.6, but this is a moot point as we have moved on.)

Here's some suggestions as to how you could help me help you:

Address point 1 by abandoning MLDatasets. I've had a quick revisit and the package looks dead to me. How about getting the MNIST data from Flux - an actively maintained package? You can see how from the ModelZoo example here. The data is in a different format, Array{Array{Gray{FixedPointNumbers.Normed{UInt8,8}},2},1}, but I am assuming that you would want to support such a format in any case?

Regarding 2, the ideal situation is that your code can handle - as a minimum - any X such that scitype(X) <: AbstractVector{<:GrayImage}, or better, scitype(X) <: AbstractVector{<:Image}. This is applying the definition of scitype explained at the bottom of the README at ScientificTypes. The above format satisfies both requirements. Then, under MLJBase 0.4.0, you could articulate your requirement with a trait declaration like:

using ScientificTypes
MLJBase.input_scitype(::Type{<:MulticlassPerceptronClassfier}) = AbstractVector{<:Image}

and users will never give you any other type, and will be able to find your model from the registry given their data.

If you believe it is absolutely essential to support other image formats (and not expect users to make coercions) then we can talk about updating the scitype definition to include them. (However, note this means all model implementers will have to support all the types included).

Of course, usage of your model will go up appreciably if you also support tabular data, which could do with a declaration like:

using ScientificTypes
MLJBase.input_scitype(::Type{<:MulticlassPerceptronClassfier}) = 
    Union{AbstractVector{<:Image},Table(Continous)}

You might want to add AbstractMatrix{<:Continuous} to the trait, if your code could handle any matrix of abstract floats, and so forth.

Currently, the scitype of a sparse version of the above image format would be AbstractArray{GrayImage} but we could overload scitype in ScientficTypes.jl to return SparseMatrixCSC{GrayImage, I} where I or something along those lines. But this is a detail for later.

I'm happy to help with the trait declarations; you just tell me precisely what (machine) types your code can handle.

As a last remark regarding tests. The fact that your test error drops on the MNIST is presumably not a very rigorous test that the code is doing what it is supposed to do. Are you planning something more? Perhaps its there but I didn't see it.

What do you think?

davidbp commented 5 years ago

Let me adrees the 2 points you mention separetly

About 1

I've updated the example and now uses the Flux MNIST loader. Notice that the data is returned as Array{Array{Gray{FixedPointNumbers.Normed{UInt8,8}},2},1} but this does not have to be how the data is passed to a train method in flux. For example here the train method uses the data in the same format as my code.

Comments show the sizes before and after transformations:

train_imgs = MNIST.images(:train)   # size(train_imgs) -> (60000,)
test_imgs  = MNIST.images(:test)    # size(test_imgs) -> (10000,)
train_x    = Float32.(hcat(reshape.(train_imgs, :)...)) # size(train_x) -> (784, 60000)
test_x     = Float32.(hcat(reshape.(test_imgs, :)...)) # size(test_x)   -> (784, 60000)

The transformation is done in line 10: X = hcat(float.(reshape.(imgs, :))...) |> gpu. This is the format that makes the most sense for both a MLP and a Multiclass Perceptron.

In any case I will think as well on how to suport other formats. Now I just want to represent an example (or an observation) as a single column vector of features (in this case there is a bijection between features and pixel intensities but this might not happen if descriptors are taken from images).

About 2

You could not determine the form of data because I fit the model with a general AbstractArray of dimension 2. The code is meant to work also work with Sparse Matrix objects.

I will take a look at how I can add more formats but it's rellevant we find a way to use "standard arrays". If that can't be done then maybe we could make a dummy struct wrapping a standard array so that it can be used for fancy multiple dispatch.

Why "standard arrays"

This is just an example that illustrates why I want to allow my model to work with a "column vector". In the following code img is "an image" (it could well be an array of RGB colors) but the image is transformed into a standard array by create_descriptor (in this case I'm actually taking slices of the image). Therefore the input of the model is not an image anymore, it is an array of size h * w * depth where depth is a hyperparameter. Notice that by h * w * depth I don't mean a 3D array, I mean a Vector of dimension h * w * depth. Even if create_descriptor generates an N-Dimensional array I flatten the information to be a simple column vector.

@time begin
    for j in 1:stride:(cols - n_col_patch)
        for i in 1:stride:(rows - n_row_patch)
            patch = view(img, i:i+n_row_patch-1, j:j+n_col_patch-1)                  
            descriptor      = create_descriptor(patch, HOG())           
            y_hat, y_score  = Perceptron.predict(percep, descriptor);

            push!(y_scores, y_score)
            push!(y_hats, y_hat)
            positions[k] = (i,j)
            k = k + 1
        end
    end
end

About tests

I've oppened issue #214 so that we can have generated datasets to build test cases for learning algorithms. I don't know about any "rigorous" way to test that the fit method does what it's supposed to do. The most reasonable test I can think is to have a little battery of datasets that, in theory, the algorithm should be able to do a perfect job on them. If the algorithm passes all of them we have some intuition that it's doing the right think ( a random set of weights will not produce an accuracy of 1.0 in several bloob like datasets).

Todo next

For me, I want to add examples of the model running on tabular data.

ablaom commented 5 years ago

Thanks for that.

You have a lot of detail here. I'm less interested in what is the most natural format for your algorithm and more interested settling the question of what format is most friendly to require of the user. Naturally, you can make all the transformations you like under the hood.

Would you have any objection, say, to the requirement that X is one or both of the following:

(i) AbstractVector{<:Image} where Image = Union{ColorImage,GrayImage} and GrayImage = AbstractArray{<:Gray,2}, AbstractArrray{<:AbstractRGB,2}? Here Gray and AbstractRGB are defined in ColorTypes ? Note this allows for the use of SparseMatrixCSV{<:Gray}, and so forth.

(ii) Any Tables.jl compatible table having AbstractFloat columns?

davidbp commented 5 years ago

I'm not sure If I have an objection because I haven't implemented it yet (there could be corner cases that I'm not aware of...).

I think we could start adding support for AbstractVector{<:Image}. If I understand it correctly, if X <: AbstractVector{<:Image} then X[1] is an image of a given size and X[2] is an image (of possibly a different size). Is this correct?

My issue is still that I would like to model pipelines containing

Image -> feature_extraction -> perceptron

where feature_extraction returns an Array (not an AbstractVector{<:Image}).

Then, to model such a thing in MLJ, I would like to create two Machine objects and make a learning network with the machines. Then fit the whole learning network. The issue is that the output of the feature_extraction will be an array and will have a conflict with the input of the second machine as far as I understand.

Before any creation involving more than a Machine I should start with the capability of creating a single machine. Right now I'm having problems trying to run the scripts basic_usage_train_machine.jl and basic_usage_train2_machine.jl because of scitype related details. Maybe you could give a hint on how to adapt the code so that both examples work.

ablaom commented 5 years ago

Thanks for that.

I think we could start adding support for AbstractVector{<:Image} . If I understand it correctly, if X <: AbstractVector{<:Image} then X[1] is an image of a given size and X[2] is an image (of possibly a different size). Is this correct?

Yes, you are right, the images could have different sizes, which potentially your algorithm would not support, and I had not thought of that. We should probably add size parameters to Image. However, for now I think we can safely ignore the issue.

My issue is still that I would like to model pipelines containing

Image -> feature_extraction -> perceptron where feature_extraction returns an Array (not an AbstractVector{<:Image}).

Then you have a choice: add Table(<:Continuous) to your supported formats, and simply insert dummy tabularization (MLJ.table) into your pipeline (has the advantage of pleasing those many who like to pass tabular data directly to the model) or you add AbstractMatrix{<:Continuous} to your supported formats, or you support all three formats: input_scitype(::Type{<:YourModel}) = Union{etc}.

Maybe you could give a hint on how to adapt the code so that both examples work.

Can you provide more specific detail on the issue? I can take a look but don't have time just right now.

ablaom commented 5 years ago

update to Image scitypes: https://github.com/alan-turing-institute/ScientificTypes.jl/commit/d54d6b87932bb2029d25e25dee49998cfd06c62a

davidbp commented 5 years ago

Hello again, sorry for trying to fix this sooner! The good of the story is that I just updated the fit method and now get a bit more accuracy.

My code works when I use the API (example1 :

 MLJBase.fit(perceptron, 1, train_x, train_y)

but is not working correctcly when I try to do example2

MLJBase.fit!(perceptron_machine)

It gets stuck during fit!(perceptron_machine) (the code gets stuck with a very high use of CPU for a long time).

davidbp commented 5 years ago

I think I found the issue

In my fit method I check the input type:


function MLJBase.fit(model::MulticlassPerceptronClassifier,
                     verbosity::Int,
                     X,
                     y)

    @assert y isa CategoricalArray "typeof(y)=$(typeof(y)) but typeof(y) should be a CategoricalArray"

    #X              = MLJBase.matrix(X)
    n_classes      = length(unique(y))
    classes_seen   = unique(y)
    n_features, _  = num_features_and_observations(X)

    # Function fit!(perceptron, X, y) expects size(X) = n_features x n_observations
    if X isa AbstractDataFrame
        X  = MLJBase.matrix(X)
        X  = copy(X')
    elseif X isa AbstractArray
        X  = MLJBase.matrix(X)
    end

But if none of the statements is true the code freezes. For example, in the MNIST case the input has type Tables.MatrixTable{Array{Float32,2}} which is not an AbstractArray. I fixed this issue but the code still freezes :S

ablaom commented 5 years ago

Okay:

1) No need to check the type of y. I assume you have something like

using ScientficTypes
MLJBase.target_scitype(::Type{<:MulticlassPerceptronClassifier}) = AbstractVector{<:Finite}

which ensures your y will be an AbstractVector with categorical elements (type CategoricalArrays.CategoricalValue or CategoricalArrays.CategoricalString).

2) You might want to think about whether you want unique(y) (classes seen) or classes(y[1]) (all categorical elements in the pool) here.

3) You would never test if something is a DataFrame. Use Tables.istable to check if it is a table. Any table can be converted to a matrix with MLJBase.matrix (a wrap of Tables.matrix). Note you don't have control of the precise type of X anyhow, only it's scientific type. So, I'm guessing you're declaring something like

MLJBase.input_scitype(::Type{<:MulticlassPerceptronClassifier}) = Union{Table(Continuous), AbstractMatrix{Continuous})

And you could include vectors of images as well, as explained earlier.

ablaom commented 5 years ago

Regarding your row/column majority issue. As @tlienart states in the slack thread, MLJ convention is for rows as patterns. However, instead of copying X when this is input as a AbstractArray, you may want to check if it is already an adjoint, and then just work with X' rather than a copy, no? So, in this case the savy user can get the performance they want.

davidbp commented 4 years ago

Why shouldn't I test if something is a DataFrame? A machine can work with DataFrames and Tables right? What I can't do is use a machine with AbstractArray.

In any case I fixed the bugs and the code now seems to work.

Machine overhead?

I have seen some overhead when using a machine vs directly fitting but it could very well be that my code acts in different ways in the event a Table is passed vs an AbstractArray.

If you clone the repo you should be able to reproduce this simply executing time julia basic_usage_train_machine.jl and time julia basic_usage_train.jl.

time with machine wrapping time julia basic_usage_train_machine.jl

MNIST Dataset Loading...

MNIST Dataset Loaded, it took 1.37 seconds

Start Learning

[ Info: Training Machine{MulticlassPerceptronClassifier} @ 1…69.
training
Epoch: 50    Accuracy: 0.899
Learning took 38.6 seconds

Results:
Train accuracy:0.9355833333333333
Test accuracy:0.9265

real    1m0.674s
user    0m56.350s
sys 0m2.412s

time without machine wrapping time julia basic_usage_train.jl

Loading data

MNIST Dataset Loading...

MNIST Dataset Loaded, it took 1.04 seconds

Start Learning

training
Epoch: 50    Accuracy: 0.899
Learning took 10.807 seconds

Results:
Train accuracy:0.9357333333333333
Test accuracy:0.9264

real    0m27.474s
user    0m26.021s
sys 0m1.234s

I am testing on:

(v1.2) pkg> status MLJ
    Status `~/.julia/environments/v1.2/Project.toml`
  [324d7699] CategoricalArrays v0.7.1
  [add582a8] MLJ v0.5.1
  [a7f614a8] MLJBase v0.7.1
  [bd369af6] Tables v0.2.11
tlienart commented 4 years ago

I think for now please ignore overheads for essentially two reasons:

Now if you have a working MLJ interface in your perceptron package, great! please open an issue in MLJModels, I'll check that I can successfully import it in MLJ and do the usual fit/predict and assuming that's the case, I'll add it to the registry.

If you don't have further MLJ-specific questions, would you mind closing this issue? otherwise please don't hesitate to ask

Thanks

PS: I've just had a quick look at your code,

# Function fit!(perceptron, X, y) expects size(X) = n_features x n_observations
    if Xnew isa AbstractDataFrame
        Xnew  = MLJBase.matrix(Xnew)
        Xnew  = copy(Xnew')           # In this case I transpose because I assume a user passed examples as rows
    elseif Xnew isa AbstractArray
        Xnew  = MLJBase.matrix(Xnew)
    elseif  Tables.istable(Xnew) 
        Xnew = copy(transpose(Tables.matrix(Xnew)))
    end

please remove this; effectively MLJ does the testing for you and you can assume in your function that X will be tabular data; you shouldn't do conversions or act differently if you use a DataFrame or a Table; so just remove all that and use

MLJBase.matrix(X, transpose=true)

but such details are better discussed in an issue in MLJModels (where there are plenty of examples you can use as blueprint). As for your question regarding dataframe, I suspect there's a confusion between what Tables are and what DataFrames are, note that istable(DataFrame()) is true.

davidbp commented 4 years ago

thank you for your feedback I open an issue to MLJModels