dfdx / NaiveBayes.jl

Naive Bayes classifier
MIT License
25 stars 19 forks source link

Examples don't work #56

Closed compleathorseplayer closed 2 years ago

compleathorseplayer commented 2 years ago

I have tried to debug examples (correcting as I go) but I can't get the following statement to work

using NaiveBayes
using RDatasets
using StatsBase

# Example 1
iris = dataset("datasets", "iris")

# observations in columns and variables in rows
#X = convert(Array, iris[:, 1:4])'
X=Matrix(iris[:,1:4])'
p, n = size(X)
# by default species is a PooledDataArray,
y = [species for species in iris[:, 5]]
#y=iris[:,5]

# how much data use for training
train_frac = 0.9
k = floor(Int, train_frac * n)
#idxs = randperm(n)
idxs=1:n
train = idxs[1:k]
test = idxs[k+1:end]

model = GaussianNB(unique(y), p)
fit(model, X[:, train], y[train])

accuracy = countnz(predict(model, X[:,test]) .== y[test]) / countnz(test)
println("Accuracy: $accuracy")

The error message is:

MethodError: no method matching fit(::GaussianNB{String}, ::Matrix{Float64}, ::CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}})
Closest candidates are:
  fit(::GaussianNB, ::Matrix{Float64}, ::Vector{C}) where C at ~/.julia/packages/NaiveBayes/A4QCi/src/gaussian.jl:1
  fit(::StatisticalModel, ::Any...) at ~/.julia/packages/StatsBase/IPydo/src/statmodels.jl:178
  fit(::Type{D}, ::Any...) where D<:Distribution at ~/.julia/packages/Distributions/vR2pk/src/genericfit.jl:34
  ...

Stacktrace:
 [1] top-level scope
   @ In[67]:25
 [2] eval
   @ ./boot.jl:373 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

Thanks for any help!

ablaom commented 2 years ago

The stack trace suggests NaiveBayes does not directly support categorical arrays for the target, only the Julia native Vector type. (If so, for a more generic interface, this requirement could probably be recoded as AbstractVector @pazzo83 @dfdx .)

In the mean time, I expect y = convert(Vector, y) should resolve your issue. Or you could use the NaiveBayes through the MLJ interface and also simplify the preprocessing:

using MLJ, RDatasets

GNBC = @iload GaussianNBClassifier pkg=NaiveBayes # import model type
model = GNBC() # default instance
y, X = unpack(iris, ==(:Species)); # split off target and features

# bind data to model in "machine" and fit:
mach = machine(model, X, y) |> fit!

# predict:
julia> predict(mach, X)[3]
                    UnivariateFinite{Multiclass{3}}      
              ┌                                        ┐ 
       setosa ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 1.0   
   versicolor ┤ 1.5522792360282905e-21                   
    virginica ┤ 3.3804400967572815e-36                   
              └                             
ablaom commented 2 years ago
yhat = predict(mach, X)

julia> pdf.(yhat, "setosa")[1:5]
5-element Vector{Float64}:
 1.0
 1.0
 1.0
 1.0
 1.0

julia> pdf(yhat[1:3], levels(y))
3×3 Matrix{Float64}:
 1.0  4.91852e-26  2.98154e-41
 1.0  7.65581e-19  1.31103e-34
 1.0  1.55228e-21  3.38044e-36
dfdx commented 2 years ago

@compleathorseplayer Sorry for the low quality of the code on recent versions of Julia - I tried to improve it a couple of times in recent years, but given the total number of packages I now maintain and how far they are from NB.jl, each time I ended up withh all broken code and no new releases.

@ablaom Given the high quality of MLJ, can I just redirect people seeking Naive Bayes implementation there?

dfdx commented 2 years ago

I've updated example 1 from the README so at least it's now runnable. Yet my previous statement stays valid, and more issues with using this package directly is expected.

ablaom commented 2 years ago

@dfdx Perhaps there is some confusion. The MLJ naive Bayes model are just wraps of the models provided by this package. So all roads currently lead to @dfdx, I'm afraid 😄 .

I can't make any solid commitment to maintenance of this package, but am happy to step in from time-to-time, as I have done in the past for now.

Would you like me to add a section to the README.md "Using NaiveBayes using the MLJ interface"?

dfdx commented 2 years ago

Pardon, I wasn't clear! I know that MLJ "just" wraps the NB.jl, but the quality of the wrapper - consistency of the API, the docs, etc. - definitely exceeds that of the NB.jl. Moreover, people usually don't go for a Naive Bayes only, but rather try out several algorithms for their task, which is exactly what MLJ provides.

So I thought about adding a quick note at the top of the README pointing a causal user to MLJ as a better way to use NaiveBayes.jl. If you have time to also add an example as you suggested, it will be even better!

compleathorseplayer commented 2 years ago

Hi Thanks very much. I do appreciate this and your work on this. Sometimes, being able to carry a screwdriver on its own (rather than carry an entire toolbox) for a single task is useful. I was just asking if, by chance, an update to the iris example above using the native version of your package might be possible (did you do an updated version of the 'iris'? I could not find it).

Having the MLJ option is great, but (as a teacher) Learning/remembering the conventions of MLJ requires Human Memory Overhead which, while arguably being worth it if one wants to try alot of methods, may not always be the most convenient if one only wants to quickly try one thing. [By the way, I am confused is NB.jl the same as NaiveBayes.jl?]

Anyway, as I said, I am grateful to you, and if you should find time to answer my original query about the 'iris' example in the 'examples' directory, that would be great - but I'll understand if you choose not to.

Thanks again

ablaom commented 2 years ago

@dfdx All good. Happy for you to refer users to MLJ. I'll see what I can do re an example in readme. I'm currently working on a project to give all MLJ models a detailed standard doc-string. Probably work that in at the same time.

dfdx commented 2 years ago

@compleathorseplayer Mea culpa, I didn't pay enough attention and thought by the "first example" you mean the one from the README, not the examples/iris.jl :facepalm: I fixed the iris example and pushed to master.

The error you got was caused by NaiveBayes.jl (which I indeed refer to as just NB.jl for brevity) using concrete types instead of proper abstract types. The fix was to replace:

fit(m::GaussianNB, X::MatrixContinuous, y::Vector{C}) where C

with

fit(m::GaussianNB, X::MatrixContinuous, y::AbstractVector{C}) where C

I left all other definitions intact though, since changing them would require rethinking type aliases and creating proper tests, which is unfortunately out of my capacity now.

compleathorseplayer commented 2 years ago
using NaiveBayes
using RDatasets
using StatsBase
using Random

# Example 1
iris = dataset("datasets", "iris")

# observations in columns and variables in rows
X = Matrix(iris[:,1:4])'
p, n = size(X)
# by default species is a PooledDataArray,
y = [species for species in iris[:, 5]]

# how much data use for training
train_frac = 0.9
k = floor(Int, train_frac * n)
idxs = randperm(n)
train_idxs = idxs[1:k]
test_idxs = idxs[k+1:end]

model = GaussianNB(unique(y), p)
fit(model, X[:, train_idxs], y[train_idxs])

accuracy = count(!iszero, predict(model, X[:,test_idxs]) .== y[test_idxs]) / count(!iszero, test_idxs)
println("Accuracy: $accuracy")

resulted in:

MethodError: no method matching fit(::GaussianNB{String}, ::Matrix{Float64}, ::CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}})
Closest candidates are:
  fit(::StatisticalModel, ::Any...) at ~/.julia/packages/StatsBase/IPydo/src/statmodels.jl:178
  fit(::GaussianNB, ::Matrix{Float64}, ::Vector{C}) where C at ~/.julia/packages/NaiveBayes/A4QCi/src/gaussian.jl:1
  fit(::Type{Histogram}, ::Any...; kwargs...) at ~/.julia/packages/StatsBase/IPydo/src/hist.jl:383
  ...

Stacktrace:
 [1] top-level scope
   @ In[24]:24
 [2] eval
   @ ./boot.jl:373 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1196

Am I missing something here? I am really keen to use this.

dfdx commented 2 years ago

You also need to use the codebase from master, e.g. via add NaiveBayes#master or by activating its environment. For example:

cd path/to/cloned/NaiveBayes.jl
julia

] activate .
include("examples/iris.jl")
compleathorseplayer commented 2 years ago

Success! Thanks!

ablaom commented 2 years ago

@dfdx Thanks for fixing the concrete typing ❤️

It looks likes tests are passing on master (the fail is only documentation). Could you please tag a new release?

compleathorseplayer commented 2 years ago

Hi - sorry I am confused - I got a notification - are you asking me to do something (or are you asking @dfdx)?

dfdx commented 2 years ago

I think it was a question for me. And yes, since the mentioned example worked, I will tag a new release as soon as I get to my computer.

compleathorseplayer commented 2 years ago

Great! Thanks so much. I will be using it in my lectures and assignments next week :-)

dfdx commented 2 years ago

The new version has been released.

compleathorseplayer commented 2 years ago

Wonderful! Thanks so much!