JuliaAI / MLJText.jl

A an MLJ extension for accessing models and tools related to text analysis
MIT License
11 stars 1 forks source link

initial commit of tfidf transformer #1

Closed pazzo83 closed 2 years ago

pazzo83 commented 3 years ago

Initial commit - a few notes

This is definitely in a more raw state than the TSVD code - so please let me know what you think should be changed/removed/added. Thanks!

storopoli commented 3 years ago

Since it is a private repo we can only have 1 reviewer per PR. I am glad to be an unofficial reviewer once it is up for review this PR.

ablaom commented 3 years ago

@pazzo83 Thanks again for this! I wonder if we could just ignore the type issue for a moment (I am working on it) and have something here that runs. Currently I'm not able to load the module. It looks like you forgot to qualify StringDocument with TextAnalysis.StringDocument or something.

julia> using MLJText
[ Info: Precompiling MLJText [5e27fcf9-6bac-46ba-8580-b5712f3d6387]
ERROR: LoadError: UndefVarError: StringDocument not defined
Stacktrace:
 [1] top-level scope
   @ ~/MLJ/MLJText/src/MLJText.jl:79
 [2] include
   @ ./Base.jl:386 [inlined]
 [3] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, 
dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)                                                
   @ Base ./loading.jl:1213
 [4] top-level scope
   @ none:1
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] eval(x::Expr)
   @ Base.MainInclude ./client.jl:446
 [7] top-level scope
   @ none:1
in expression starting at /Users/anthony/MLJ/MLJText/src/MLJText.jl:1
ERROR: Failed to precompile MLJText [5e27fcf9-6bac-46ba-8580-b5712f3d6387] to /Users/anthony/.julia/compiled/v1.6/MLJText/jl_JBCoIn.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::Base.TTY, internal_stdout::Base.TTY)                                                                                 
   @ Base ./loading.jl:1360
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base ./loading.jl:1306
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1021
 [5] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:914
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:901
pazzo83 commented 3 years ago

Thanks for taking a look! I did a little refactoring and I think I fixed that issue with the TextAnalysis-related types. Do you think we need to change anything in Project.toml for some of the latest versions of the dependencies we have (e.g. MLJModelInterface, etc)?

ablaom commented 3 years ago

You can dump Distributions from [extras] (unless you need it). We'll want the latest version of ScientificTypesBase after I add the text types, but it could go to 2 for now.

pazzo83 commented 3 years ago

Alright I updated the Project file - I am going to work on some tests now.

ablaom commented 3 years ago

@pazzo83 Following up on your slack comment. I guess, in addition to documents (in the sense of this comment) (or instead??) we should allow as permissible input to the tiff transformer to be bags of ngrams?

A "bag of ngrams" could just be a dictionary of integers keyed on abstract strings, which is how TextAnalysis.NGramDocument represents them. The down side of this is that, without the metadata, we don't know what n is. I'm not so keen on passing this information around as metadata. It's kind of analogous to passing around "total number of classes" when representing categorical data using integers. It complicates the overall design (an we use CategoricalArrays in MLJ for this reason).

So, better, would be to encode an ngram as a n-tuple, such as ("my", "goodness"). Under the current scitype convention, we guarantee x is an ngram by insisting that scitype(x) = NTuple{Textual,n} (ignoring tagged words for now). So a bag of ngrams is a an AbstractDict{<:NTuple{<:AbstractString},<:Integer}, corresponding to scitype Multiset{NTuple{Textual,n}}.

Presumably converting an NGramDocument to a dictionary keyed on tuples would not be problematic. We could introduce a coerce(::NGramDocument, Multiset) method to ScientificTypes, [like we do for images(https://juliaai.github.io/ScientificTypes.jl/dev/#Type-coercion-for-image-data).

Thoughts?

pazzo83 commented 3 years ago

@ablaom That sounds good to me - the only issue is in both the TextAnalysis implementation of the NGramDocument and scikit-learn's Tfidf implementation allows for multiple ns for ngrams . In other words, you could have a situation where your bag of words has both unigrams and bigrams. This is done in TextAnalysis with something like NGramDocument(document_text, 1, 2).

ablaom commented 3 years ago

Could we not handle this way: If NGram{N} is an alias for Tuple{N,Textual} then allow any input X with scitype(X) <: Multiset{Union{Textual,NGram}} ? Such X would be a bag of ngrams of possibly inhomogeneous lengths, and allows unigrams to be represented either as AbstractString or length-one tuples of AbstractString.

pazzo83 commented 3 years ago

@ablaom ok that sounds good - I will get something working with that type of input.

pazzo83 commented 3 years ago

@ablaom if I do something like d = Dict(Tuple(String.(split(k))) => v for (k, v) in ngrams(ngram_doc)) (where ngram_doc was created via: ngram_doc = NGramDocument(doc, 1, 2)) to do this conversion, then the type I get back is: Dict{Tuple{String, Vararg{String, N} where N}, Int64}. Is that Tuple type more or less what we're looking for?

ablaom commented 3 years ago

Yes, that's right.

But I see I shall have to update the scitype implementation. For while I have

julia> scitype(ngrams(ngram_doc))
Multiset{Textual}

I'm getting

julia> scitype(d)
Unknown
ablaom commented 3 years ago

Now fixed in ScientificTypes 2.2.2:

https://github.com/JuliaAI/ScientificTypes.jl/blob/39e7aab07fb08587a1ca119b1b3894a094deeecc/test/scitypes.jl#L260

pazzo83 commented 3 years ago

@ablaom I am kind of stuck on resolving this.
If I define the NGram alias like this: const NGram{N} = Tuple{Vararg{Textual, N} where N}, then that type doesn't match the type from the variable length bag of words dictionary I referenced above (where the tuple type was Tuple{String, Vararg{String, N} where N}).

However, if i then define the alias like: const NGram{N} = Tuple{Textual, Vararg{Textual, N}} where N, then you actually get an N + 1 length tuple (because of the initial Textual type listed).

Defining the alias like: const NGram{N} = NTuple{N, Textual} likewise does not match the type from the bag of words dictionary.

Do you have any suggestions as to how to proceed?

pazzo83 commented 3 years ago

@ablaom I've made some progress, but still running into type issues. I have an alias: const NGram{N} = NTuple{N, <:AbstractString}

Then, I coerce my bag of words dictionary from the NGramDocument as follows: d = Dict{NGram, Int}(Tuple(String.(split(k))) => v for (k, v) in ngrams(ngram_doc))

This gives me a scitype of: ScientificTypesBase.Multiset{Tuple{Vararg{Textual, var"#s13"}} where var"#s13"}

However, when I try to associate this with my transformer (e.g. test = machine(tfidf_transformer, [d])), I get the following warning:

┌ Warning: The scitype of `X`, in `machine(model, X)` is incompatible with `model=TfidfTransformer @498`:
│ scitype(X) = AbstractVector{ScientificTypesBase.Multiset{Tuple{Vararg{Textual, var"#s13"}} where var"#s13"}}
│ input_scitype(model) = Union{AbstractVector{AbstractVector{Textual}}, AbstractVector{ScientificTypesBase.Multiset{Union{Textual, Tuple{Vararg{var"#s32", N}} where {N, var"#s32"<:AbstractString}}}}}

I'm not quite sure how to resolve this issue.

ablaom commented 3 years ago

@pazzo83 Thanks for hanging in there. This must be a little frustrating.

I don't see NTuple anywhere in your current input_scitype declaration. The following works for me:

bag_of_words = Dict("cat in" => 1,
                    "the hat" => 1,
                    "the" => 2,
                    "cat" => 1,
                    "hat" => 1,
                    "in the" => 1,
                    "in" => 1,
                    "the cat" => 1)
bag = Dict(Tuple(String.(split(k))) => v for (k, v) in bag_of_words)

const ScientificNGram{N} = NTuple{<:Any,Textual} 

input_scitype = Union{AbstractVector{<:AbstractVector{Textual}},
          AbstractVector{<:Multiset{<:ScientificNGram}}}
scitype([bag, ]) <: input_scitype   # true

So if you add the above input_scitype in your metadata declaration, you should be good.

pazzo83 commented 3 years ago

@ablaom I've pushed the changes I've been working on . I can replicate the results of the code you posted, but I still get warnings when doing the following:

using MLJ

bag_of_words = Dict("cat in" => 1,
                           "the hat" => 1,
                           "the" => 2,
                           "cat" => 1,
                           "hat" => 1,
                           "in the" => 1,
                           "in" => 1,
                           "the cat" => 1
)
bag = Dict{MLJText.NGram, Int}(Tuple(String.(split(k))) => v for (k, v) in bag_of_words)
tfidf_transformer = MLJText.TfidfTransformer()
test_machine = machine(tfidf_transformer [bag]) # Warning
# Warning: The scitype of `X`, in `machine(model, X)` is incompatible with `model=TfidfTransformer @802`:
# │ scitype(X) = AbstractVector{Multiset{Tuple{Vararg{Textual, var"#s2"}} where var"#s2"}}
# │ input_scitype(model) = Union{AbstractVector{var"#s32"} where var"#s32"<:AbstractVector{Textual},  AbstractVector{var"#s31"} where var"#s31"<:(Multiset{var"#s30"} where var"#s30"<:(Tuple{Vararg{var"#s31", var"#s32"}} where {var"#s32", var"#s31"<:AbstractString})), AbstractVector{var"#s3"} where var"#s3"<:Multiset{Textual}}.
fit!(test_machine)

There is some issue with it resolving the parametric type from my NGram type and how it is resolved from ScientificTypes

ablaom commented 3 years ago

😢 I'll look into this on Monday. Thanks for your work here.

storopoli commented 3 years ago

@ablaom Let me know how can I assist

ablaom commented 3 years ago

@pazzo83 I've fixed this now. We hadNgram where ScientificNGram was needed. I added @test_logs that ensure no warnings are being thrown.

pazzo83 commented 3 years ago

@ablaom this is great - thank you! I definitely appreciate that we are taking the time to get this right, as I really like this structure/design. It is much more flexible than what you have with scikit-learn's implementation. Regarding the change, is it OK if we keep my NGram type alias for the main code of the transformer (used for multiple dispatch purposes with the various methods) and then use the ScientificNGram alias you have added for purposes of having it resolve to the proper scitype?

ablaom commented 3 years ago

is it OK if we keep my NGram type alias for the main code of the transformer (used for multiple dispatch purposes with the various methods) and then use the ScientificNGram alias you have added for purposes of having it resolve to the proper scitype?

Not quite sure I understand the question. I just added the ScientificNGram alias to improve readability and swapped that one NGram out for ScientificNGram to make the code correct. Elsewhere your existing NGram definition is consistent with its use, as far as I can tell.

Are you suggesting exporting one or both aliases?

pazzo83 commented 3 years ago

@ablaom - no, I just wanted to clarify that we were keeping both aliases in the code itself, so that answers my question! Do we think this is ready to go?

pazzo83 commented 3 years ago

Was just testing this out on some documents locally, and I was able to do this:

convert_to_tuple_form(bag_of_words::Dict{<:AbstractString, Int}) = 
    Dict{MLJText.NGram, Int}(Tuple(String.(split(k))) => v for (k, v) in bag_of_words)

pipe = @pipeline(
    X -> ngrams.(X),
    X -> convert_to_tuple_form.(X),
    MLJText.TfidfTransformer(min_doc_freq = 0.13, max_doc_freq = 0.80),
    TSVDTransformer(nvals=500, maxiter=5000),
    SVC
)

and it worked beautifully! I suspect we should add some sort of coerce method - where would it be appropriate to add that?

ablaom commented 3 years ago

@pazzo83 I should like to make proper review of the code which I have not done in any detail, and will try to get to that early next week at the latest.

A coerce method would live at ScientificTypes here. As an example, you can look at what is offered for images in these docsh.

I think we should distinguish between conversions that alter the representation of an object, so as to conform to a convention (what scitype is for) from real "transformations" of the data, for which we should have MLJ transformers, which could be Static or Unsupervised. So, for example, I'd classify your method above as a transformer, because of the tokenization (use of split). You may want to use a different kind of tokenisation, which means specifying options, and scitype isn't designed for this. Does this make sense?

With this is mind, I'd like to hear what coerce methods you see as useful, and if necessary, what scientific type aliases we might want to introduce, to reduce user head-scratching. Perhaps NGram (the scientific version called ScientificNGram in the latest code) would be helpful. (That said, we introduced Binary = Finite{2} as an alias, which I regret, as it's really not needed.)

Perhaps @storopoli cares to chip in here too.

I don't know if this was smart, but unlike Base.convert the coerce method works like this: You specify coerce(array, S) where S is the scitype you would like the elements of array to have. This is convenient for the use-cases we initially had in mind. So if you have a vector of Dict{String}{Int64} that you to want coerce into to unigrams, you might do

coerce(vec_of_bags_of_words, Multiset{Ngram{1}})

However, for images we did break this pattern and I'm wondering if it is necessary here as well.

Thanks for the positive feedback and enthusiasm. I'm only sorry I don't have more time to support these enhancements. 😄

storopoli commented 3 years ago

Do we have a coerce! to signal the user that this kind of coercion will have side-effects? I think that we should at least cover all scikit-learn text feature extraction features:

This should be the target for a simple Text Feature Extraction API

ablaom commented 3 years ago

Do we have a coerce! to signal the user that this kind of coercion will have side-effects? I think that we should at least cover all scikit-learn text feature extraction features:

coerce! was introduced for coercing tabular data for those tables that support in-place replacement of columns. Because the generic Tables.jl API does not require tables to support this, it must be implemented case-by-case, and it is only implemented for DataFrames at present.

ablaom commented 3 years ago

@pazzo83 @pazzo83

I'm trying to understand the choice of representation for the sparse matrix returned by this transformer. It seems we are returning an n x p matrix of type SparseMatrixCSC, where n is the number of documents and p the number of tokens. This shape is consistent with MLJ conventions which is good.

However, is it not typical that p >> n? In that case, I would have thought the adjoint of a p x n SparseMatrixCSV was preferred, for memory reasons, unless we wanted slicing on tokens to be faster than slicing on documents (which sounds strange to me):

julia> a = sparse(rand([1, 2], 10), rand(1:1000, 10), 1:10:100, 2, 10000) |> Base.summarysize
80328

julia> a = adjoint(sparse(rand(1:1000, 10), rand([1, 2], 10), 1:10:100, 10000, 2)) |> Base.summarysize
384

I see that sk-learn,'s TifidTransformer, which uses the CRC format, uses n x p. If they have it right, then we ought to use adjoint of n x p since we have the other compression format, no?

I do see that what we return is consistent with TextAnalysis.DocumentTermMatrix, which I guess is the reason we have made this choice.

pazzo83 commented 3 years ago

@pazzo83 @pazzo83

I'm trying to understand the choice of representation for the sparse matrix returned by this transformer. It seems we are returning an n x p matrix of type SparseMatrixCSC, where n is the number of documents and p the number of tokens. This shape is consistent with MLJ conventions which is good.

However, is it not typical that p >> n? In that case, I would have thought the adjoint of a p x n SparseMatrixCSV was preferred, for memory reasons, unless we wanted slicing on tokens to be faster than slicing on documents (which sounds strange to me):

julia> a = sparse(rand([1, 2], 10), rand(1:1000, 10), 1:10:100, 2, 10000) |> Base.summarysize
80328

julia> a = adjoint(sparse(rand(1:1000, 10), rand([1, 2], 10), 1:10:100, 10000, 2)) |> Base.summarysize
384

I see that sk-learn,'s TifidTransformer, which uses the CRC format, uses n x p. If they have it right, then we ought to use adjoint of n x p since we have the other compression format, no?

I do see that what we return is consistent with TextAnalysis.DocumentTermMatrix, which I guess is the reason we have made this choice.

I definitely see the reasoning behind transposing what we return - as you mention, p is often much larger than n (which is why you often see some sort of dimensionality reduction applied after you do TF-IDF). I guess I would just want to understand better how this would work. We get the DTM (document term matrix) from TextAnalysis as n x p - would we flip that and then build the TF-IDF matrix? I can experiment with some different approaches. However, essentially what you are suggesting here is we build the matrix as p x n and then return the adjoint of that?

ablaom commented 3 years ago

I can experiment with some different approaches. However, essentially what you are suggesting here is we build the matrix as p x n and then return the adjoint of that?

That is what I am suggesting. However, I think we should probably find out if there is a reason for the TextAnalysis choice. I personally don't have much hands on experience in this area, and that's why I ask. Also, if we use TextAnalysis as is but follow my suggestion, there will be a performance hit as we are effectively changing the axis of index compression. (It would be better to re-implement the TextAnalysis method.) I'll open a query on the natural-language slack channel.

pazzo83 commented 3 years ago

I can experiment with some different approaches. However, essentially what you are suggesting here is we build the matrix as p x n and then return the adjoint of that?

That is what I am suggesting. However, I think we should probably find out if there is a reason for the TextAnalysis choice. I personally don't have much hands on experience in this area, and that's why I ask. Also, if we use TextAnalysis as is but follow my suggestion, there will be a performance hit as we are effectively changing the axis of index compression. (It would be better to re-implement the TextAnalysis method.) I'll open a query on the natural-language slack channel.

Yeah - as I've been going through the code, I think it would just be easier to re-implement the DTM ourselves, at least for now.

ablaom commented 3 years ago

@pazzo83

Thanks again for this work.

I think my comments above at least cover the MLJ-API related stuff. A last suggestion is that we separate the transformer (and corresponding test file) into separate files, but it's probably less messy if do that after merging this PR.

Since @storopoli has offered to help out, and I expect he has more NLP knowledge than I do, I suggest we see if @storopoli can do a formal code review when you're ready. If it's possible, I should like to limit my involvement to the integration side of things.

Are you happy for me to make the repo public to facilitate a review?

pazzo83 commented 3 years ago

@ablaom That sounds great! Thanks for all your help here!

ablaom commented 3 years ago

Done.

pazzo83 commented 3 years ago

Ok @ablaom @storopoli - I made a few additional changes. I updated the deps - is it ok to have MLJModelInterface set to just 1.3? I'll use whatever the convention is there (figured I'd bump it up to at least that since 1.3.2 is the current version). I also implemented the switch in orientation of the actual TF-IDF matrix that we discussed. It does seem to be quite a bit faster - and works just fine in some of the local tests I have.

pazzo83 commented 3 years ago

Also, you all might notice that we keep failing in the Julia 1.3 check. I was doing some testing with different Julia versions (thank you Google Colab), and the issue appears to be how I am specifying types in the build_corpus method. As you see, I've defined more or less a type alias for NGrams as such:

const NGram{N} = NTuple{<:Any,<:AbstractString}

Then I use that type alias for multiple dispatch in a function that constructs the proper TextAnalysis.Corpus object:

_convert_bag_of_words(X::Dict{NGram, Int}) = 
    Dict(join(k, " ") => v for (k, v) in X)

build_corpus(X::Vector{Dict{NGram, Int}}) = 
    build_corpus(_convert_bag_of_words.(X))
build_corpus(X::Vector{Dict{S, Int}}) where {S <: AbstractString} = 
    Corpus(NGramDocument.(X))
build_corpus(X) = Corpus(TokenDocument.(X))

In versions prior to Julia 1.5, the compiler blows up when it tries to deal with a Dict with the NGram type alias as a key. So something like this:

test_dict2 = Dict{NGram, Int}(("a", "b") => 1, ("c", "d", "e") => 2)

causes Julia to crash. However, starting in 1.5, it does just fine. Should we bump up the Julia version in Compat to 1.5?

ablaom commented 3 years ago

I updated the deps - is it ok to have MLJModelInterface set to just 1.3? Yep, that's fine.

Re your fail, I'm guessing your have some missing <:. So what if you try:

_convert_bag_of_words(X::Dict{<:NGram, <:Integer}) = 
    Dict(join(k, " ") => v for (k, v) in X)

build_corpus(X::Vector{<:Dict{<:NGram, <:Integer}}) = 
    build_corpus(_convert_bag_of_words.(X))
build_corpus(X::Vector{<:Dict{S, <:Integer}}) where {S <: AbstractString} = 
    Corpus(NGramDocument.(X))
build_corpus(X) = Corpus(TokenDocument.(X))
codecov-commenter commented 2 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (dev@7349f5c). Click here to learn what that means. The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##             dev       #1   +/-   ##
======================================
  Coverage       ?   80.48%           
======================================
  Files          ?        1           
  Lines          ?       82           
  Branches       ?        0           
======================================
  Hits           ?       66           
  Misses         ?       16           
  Partials       ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7349f5c...108b732. Read the comment docs.