JuliaAI / MLJText.jl

A an MLJ extension for accessing models and tools related to text analysis
MIT License
11 stars 1 forks source link

Samples no longer work #33

Open alunap opened 1 month ago

alunap commented 1 month ago

If I run the example code (any of them) I get a failure.

using MLJ, MLJText, TextAnalysis

docs = ["Hi my name is Sam.", "How are you today?"]
tfidf_transformer = TfidfTransformer()
mach = machine(tfidf_transformer, tokenize.(docs))
MLJ.fit!(mach)

tfidf_mat = transform(mach, tokenize.(docs))

will fail:

ERROR: MethodError: no method matching tokenize(::String)

Closest candidates are:
  tokenize(::Type{S}, ::T) where {S<:Languages.Language, T<:AbstractString}
   @ TextAnalysis ~/.julia/packages/TextAnalysis/bdhyq/src/deprecations.jl:4
  tokenize(::S, ::T) where {S<:Languages.Language, T<:AbstractString}
   @ TextAnalysis ~/.julia/packages/TextAnalysis/bdhyq/src/tokenizer.jl:19

Stacktrace:
 [1] _broadcast_getindex_evalf
   @ ./broadcast.jl:709 [inlined]
 [2] _broadcast_getindex
   @ ./broadcast.jl:682 [inlined]
 [3] getindex
   @ ./broadcast.jl:636 [inlined]
 [4] copy
   @ ./broadcast.jl:942 [inlined]
 [5] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(tokenize), Tuple{Vector{String}}})
   @ Base.Broadcast ./broadcast.jl:903
 [6] top-level scope
   @ REPL[11]:1
ablaom commented 1 month ago

Thanks for posting.

The tokenize method is owned by TextAnalysis, and I see that TextAnalysis 0.8 no longer supports the version dispatching on strings. One workaround is to use TextAnalysis 0.7.5.

Posted a query here about what the correct method is now.

pazzo83 commented 1 month ago

It looks like it requires a "language" to be set, but it's not actually used: https://github.com/JuliaText/TextAnalysis.jl/blame/master/src/tokenizer.jl

I think one thing we can do is just use the tokenization from WordTokenzers directly. I can look at making this change.

ablaom commented 1 month ago

Thanks @pazzo83 for looking into this. This doesn't appear to affect MLJText tests, which are still passing (the compat for TextAnaysis includes the latest 0.8).

So maybe all that's needed is to update the docstrings.

ablaom commented 1 month ago

@pazzo83 Any chance of a PR to fix this?

pazzo83 commented 2 weeks ago

Sorry for the delay! I realized that we only need to update the Readme (since this is the code that is being referred to). I will be doing that.