Closed pazzo83 closed 2 years ago
Since it is a private repo we can only have 1 reviewer per PR. I am glad to be an unofficial reviewer once it is up for review this PR.
@pazzo83 Thanks again for this! I wonder if we could just ignore the type issue for a moment (I am working on it) and have something here that runs. Currently I'm not able to load the module. It looks like you forgot to qualify StringDocument
with TextAnalysis.StringDocument
or something.
julia> using MLJText
[ Info: Precompiling MLJText [5e27fcf9-6bac-46ba-8580-b5712f3d6387]
ERROR: LoadError: UndefVarError: StringDocument not defined
Stacktrace:
[1] top-level scope
@ ~/MLJ/MLJText/src/MLJText.jl:79
[2] include
@ ./Base.jl:386 [inlined]
[3] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String},
dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
@ Base ./loading.jl:1213
[4] top-level scope
@ none:1
[5] eval
@ ./boot.jl:360 [inlined]
[6] eval(x::Expr)
@ Base.MainInclude ./client.jl:446
[7] top-level scope
@ none:1
in expression starting at /Users/anthony/MLJ/MLJText/src/MLJText.jl:1
ERROR: Failed to precompile MLJText [5e27fcf9-6bac-46ba-8580-b5712f3d6387] to /Users/anthony/.julia/compiled/v1.6/MLJText/jl_JBCoIn.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::Base.TTY, internal_stdout::Base.TTY)
@ Base ./loading.jl:1360
[3] compilecache(pkg::Base.PkgId, path::String)
@ Base ./loading.jl:1306
[4] _require(pkg::Base.PkgId)
@ Base ./loading.jl:1021
[5] require(uuidkey::Base.PkgId)
@ Base ./loading.jl:914
[6] require(into::Module, mod::Symbol)
@ Base ./loading.jl:901
Thanks for taking a look! I did a little refactoring and I think I fixed that issue with the TextAnalysis-related types. Do you think we need to change anything in Project.toml for some of the latest versions of the dependencies we have (e.g. MLJModelInterface, etc)?
You can dump Distributions from [extras]
(unless you need it). We'll want the latest version of ScientificTypesBase after I add the text types, but it could go to 2 for now.
Alright I updated the Project file - I am going to work on some tests now.
@pazzo83 Following up on your slack comment. I guess, in addition to documents (in the sense of this comment) (or instead??) we should allow as permissible input to the tiff
transformer to be bags of ngrams?
A "bag of ngrams" could just be a dictionary of integers keyed on abstract strings, which is how TextAnalysis.NGramDocument
represents them. The down side of this is that, without the metadata, we don't know what n
is. I'm not so keen on passing this information around as metadata. It's kind of analogous to passing around "total number of classes" when representing categorical data using integers. It complicates the overall design (an we use CategoricalArrays in MLJ for this reason).
So, better, would be to encode an ngram as a n-tuple, such as ("my", "goodness")
. Under the current scitype convention, we guarantee x
is an ngram
by insisting that scitype(x) = NTuple{Textual,n}
(ignoring tagged words for now). So a bag of ngrams
is a an AbstractDict{<:NTuple{<:AbstractString},<:Integer}
, corresponding to scitype Multiset{NTuple{Textual,n}}
.
Presumably converting an NGramDocument
to a dictionary keyed on tuples would not be problematic. We could introduce a coerce(::NGramDocument, Multiset)
method to ScientificTypes, [like we do for images(https://juliaai.github.io/ScientificTypes.jl/dev/#Type-coercion-for-image-data).
Thoughts?
@ablaom That sounds good to me - the only issue is in both the TextAnalysis
implementation of the NGramDocument
and scikit-learn's Tfidf implementation allows for multiple n
s for ngrams . In other words, you could have a situation where your bag of words has both unigrams and bigrams. This is done in TextAnalysis
with something like NGramDocument(document_text, 1, 2)
.
Could we not handle this way: If NGram{N}
is an alias for Tuple{N,Textual}
then allow any input X
with scitype(X) <: Multiset{Union{Textual,NGram}}
? Such X
would be a bag of ngrams of possibly inhomogeneous lengths, and allows unigrams to be represented either as AbstractString
or length-one tuples of AbstractString
.
@ablaom ok that sounds good - I will get something working with that type of input.
@ablaom if I do something like d = Dict(Tuple(String.(split(k))) => v for (k, v) in ngrams(ngram_doc))
(where ngram_doc
was created via: ngram_doc = NGramDocument(doc, 1, 2)
) to do this conversion, then the type I get back is: Dict{Tuple{String, Vararg{String, N} where N}, Int64}
. Is that Tuple type more or less what we're looking for?
Yes, that's right.
But I see I shall have to update the scitype
implementation. For while I have
julia> scitype(ngrams(ngram_doc))
Multiset{Textual}
I'm getting
julia> scitype(d)
Unknown
Now fixed in ScientificTypes 2.2.2:
@ablaom I am kind of stuck on resolving this.
If I define the NGram
alias like this:
const NGram{N} = Tuple{Vararg{Textual, N} where N}
, then that type doesn't match the type from the variable length bag of words dictionary I referenced above (where the tuple type was Tuple{String, Vararg{String, N} where N}
).
However, if i then define the alias like:
const NGram{N} = Tuple{Textual, Vararg{Textual, N}} where N
, then you actually get an N
+ 1 length tuple (because of the initial Textual
type listed).
Defining the alias like: const NGram{N} = NTuple{N, Textual}
likewise does not match the type from the bag of words dictionary.
Do you have any suggestions as to how to proceed?
@ablaom I've made some progress, but still running into type issues. I have an alias:
const NGram{N} = NTuple{N, <:AbstractString}
Then, I coerce my bag of words dictionary from the NGramDocument
as follows:
d = Dict{NGram, Int}(Tuple(String.(split(k))) => v for (k, v) in ngrams(ngram_doc))
This gives me a scitype of:
ScientificTypesBase.Multiset{Tuple{Vararg{Textual, var"#s13"}} where var"#s13"}
However, when I try to associate this with my transformer (e.g. test = machine(tfidf_transformer, [d])
), I get the following warning:
┌ Warning: The scitype of `X`, in `machine(model, X)` is incompatible with `model=TfidfTransformer @498`:
│ scitype(X) = AbstractVector{ScientificTypesBase.Multiset{Tuple{Vararg{Textual, var"#s13"}} where var"#s13"}}
│ input_scitype(model) = Union{AbstractVector{AbstractVector{Textual}}, AbstractVector{ScientificTypesBase.Multiset{Union{Textual, Tuple{Vararg{var"#s32", N}} where {N, var"#s32"<:AbstractString}}}}}
I'm not quite sure how to resolve this issue.
@pazzo83 Thanks for hanging in there. This must be a little frustrating.
I don't see NTuple
anywhere in your current input_scitype
declaration. The following works for me:
bag_of_words = Dict("cat in" => 1,
"the hat" => 1,
"the" => 2,
"cat" => 1,
"hat" => 1,
"in the" => 1,
"in" => 1,
"the cat" => 1)
bag = Dict(Tuple(String.(split(k))) => v for (k, v) in bag_of_words)
const ScientificNGram{N} = NTuple{<:Any,Textual}
input_scitype = Union{AbstractVector{<:AbstractVector{Textual}},
AbstractVector{<:Multiset{<:ScientificNGram}}}
scitype([bag, ]) <: input_scitype # true
So if you add the above input_scitype
in your metadata declaration, you should be good.
@ablaom I've pushed the changes I've been working on . I can replicate the results of the code you posted, but I still get warnings when doing the following:
using MLJ
bag_of_words = Dict("cat in" => 1,
"the hat" => 1,
"the" => 2,
"cat" => 1,
"hat" => 1,
"in the" => 1,
"in" => 1,
"the cat" => 1
)
bag = Dict{MLJText.NGram, Int}(Tuple(String.(split(k))) => v for (k, v) in bag_of_words)
tfidf_transformer = MLJText.TfidfTransformer()
test_machine = machine(tfidf_transformer [bag]) # Warning
# Warning: The scitype of `X`, in `machine(model, X)` is incompatible with `model=TfidfTransformer @802`:
# │ scitype(X) = AbstractVector{Multiset{Tuple{Vararg{Textual, var"#s2"}} where var"#s2"}}
# │ input_scitype(model) = Union{AbstractVector{var"#s32"} where var"#s32"<:AbstractVector{Textual}, AbstractVector{var"#s31"} where var"#s31"<:(Multiset{var"#s30"} where var"#s30"<:(Tuple{Vararg{var"#s31", var"#s32"}} where {var"#s32", var"#s31"<:AbstractString})), AbstractVector{var"#s3"} where var"#s3"<:Multiset{Textual}}.
fit!(test_machine)
There is some issue with it resolving the parametric type from my NGram
type and how it is resolved from ScientificTypes
😢 I'll look into this on Monday. Thanks for your work here.
@ablaom Let me know how can I assist
@pazzo83 I've fixed this now. We hadNgram
where ScientificNGram
was needed. I added @test_logs
that ensure no warnings are being thrown.
@ablaom this is great - thank you! I definitely appreciate that we are taking the time to get this right, as I really like this structure/design. It is much more flexible than what you have with scikit-learn's implementation.
Regarding the change, is it OK if we keep my NGram
type alias for the main code of the transformer (used for multiple dispatch purposes with the various methods) and then use the ScientificNGram
alias you have added for purposes of having it resolve to the proper scitype?
is it OK if we keep my NGram type alias for the main code of the transformer (used for multiple dispatch purposes with the various methods) and then use the ScientificNGram alias you have added for purposes of having it resolve to the proper scitype?
Not quite sure I understand the question. I just added the ScientificNGram alias to improve readability and swapped that one NGram out for ScientificNGram to make the code correct. Elsewhere your existing NGram definition is consistent with its use, as far as I can tell.
Are you suggesting exporting one or both aliases?
@ablaom - no, I just wanted to clarify that we were keeping both aliases in the code itself, so that answers my question! Do we think this is ready to go?
Was just testing this out on some documents locally, and I was able to do this:
convert_to_tuple_form(bag_of_words::Dict{<:AbstractString, Int}) =
Dict{MLJText.NGram, Int}(Tuple(String.(split(k))) => v for (k, v) in bag_of_words)
pipe = @pipeline(
X -> ngrams.(X),
X -> convert_to_tuple_form.(X),
MLJText.TfidfTransformer(min_doc_freq = 0.13, max_doc_freq = 0.80),
TSVDTransformer(nvals=500, maxiter=5000),
SVC
)
and it worked beautifully! I suspect we should add some sort of coerce
method - where would it be appropriate to add that?
@pazzo83 I should like to make proper review of the code which I have not done in any detail, and will try to get to that early next week at the latest.
A coerce
method would live at ScientificTypes
here. As an example, you can look at what is offered for images in these docsh.
I think we should distinguish between conversions that alter the representation of an object, so as to conform to a convention (what scitype
is for) from real "transformations" of the data, for which we should have MLJ transformers, which could be Static
or Unsupervised
. So, for example, I'd classify your method above as a transformer, because of the tokenization (use of split
). You may want to use a different kind of tokenisation, which means specifying options, and scitype
isn't designed for this. Does this make sense?
With this is mind, I'd like to hear what coerce
methods you see as useful, and if necessary, what scientific type aliases we might want to introduce, to reduce user head-scratching. Perhaps NGram
(the scientific version called ScientificNGram
in the latest code) would be helpful. (That said, we introduced Binary = Finite{2}
as an alias, which I regret, as it's really not needed.)
Perhaps @storopoli cares to chip in here too.
I don't know if this was smart, but unlike Base.convert
the coerce
method works like this: You specify coerce(array, S)
where S
is the scitype you would like the elements of array
to have. This is convenient for the use-cases we initially had in mind. So if you have a vector of Dict{String}{Int64}
that you to want coerce into to unigrams, you might do
coerce(vec_of_bags_of_words, Multiset{Ngram{1}})
However, for images we did break this pattern and I'm wondering if it is necessary here as well.
Thanks for the positive feedback and enthusiasm. I'm only sorry I don't have more time to support these enhancements. 😄
Do we have a coerce!
to signal the user that this kind of coercion will have side-effects?
I think that we should at least cover all scikit-learn text feature extraction features:
This should be the target for a simple Text Feature Extraction API
Do we have a coerce! to signal the user that this kind of coercion will have side-effects? I think that we should at least cover all scikit-learn text feature extraction features:
coerce!
was introduced for coercing tabular data for those tables that support in-place replacement of columns. Because the generic Tables.jl API does not require tables to support this, it must be implemented case-by-case, and it is only implemented for DataFrames
at present.
@pazzo83 @pazzo83
I'm trying to understand the choice of representation for the sparse matrix returned by this transformer. It seems we are returning an n x p
matrix of type SparseMatrixCSC
, where n
is the number of documents and p
the number of tokens. This shape is consistent with MLJ conventions which is good.
However, is it not typical that p >> n
? In that case, I would have thought the adjoint of a p x n
SparseMatrixCSV
was preferred, for memory reasons, unless we wanted slicing on tokens to be faster than slicing on documents (which sounds strange to me):
julia> a = sparse(rand([1, 2], 10), rand(1:1000, 10), 1:10:100, 2, 10000) |> Base.summarysize
80328
julia> a = adjoint(sparse(rand(1:1000, 10), rand([1, 2], 10), 1:10:100, 10000, 2)) |> Base.summarysize
384
I see that sk-learn,'s TifidTransformer
, which uses the CRC format, uses n x p
. If they have it right, then we ought to use adjoint of n x p
since we have the other compression format, no?
I do see that what we return is consistent with TextAnalysis.DocumentTermMatrix
, which I guess is the reason we have made this choice.
@pazzo83 @pazzo83
I'm trying to understand the choice of representation for the sparse matrix returned by this transformer. It seems we are returning an
n x p
matrix of typeSparseMatrixCSC
, wheren
is the number of documents andp
the number of tokens. This shape is consistent with MLJ conventions which is good.However, is it not typical that
p >> n
? In that case, I would have thought the adjoint of ap x n
SparseMatrixCSV
was preferred, for memory reasons, unless we wanted slicing on tokens to be faster than slicing on documents (which sounds strange to me):julia> a = sparse(rand([1, 2], 10), rand(1:1000, 10), 1:10:100, 2, 10000) |> Base.summarysize 80328 julia> a = adjoint(sparse(rand(1:1000, 10), rand([1, 2], 10), 1:10:100, 10000, 2)) |> Base.summarysize 384
I see that sk-learn,'s
TifidTransformer
, which uses the CRC format, usesn x p
. If they have it right, then we ought to use adjoint ofn x p
since we have the other compression format, no?I do see that what we return is consistent with
TextAnalysis.DocumentTermMatrix
, which I guess is the reason we have made this choice.
I definitely see the reasoning behind transposing what we return - as you mention, p
is often much larger than n
(which is why you often see some sort of dimensionality reduction applied after you do TF-IDF). I guess I would just want to understand better how this would work. We get the DTM (document term matrix) from TextAnalysis
as n x p
- would we flip that and then build the TF-IDF matrix?
I can experiment with some different approaches. However, essentially what you are suggesting here is we build the matrix as p x n
and then return the adjoint of that?
I can experiment with some different approaches. However, essentially what you are suggesting here is we build the matrix as p x n and then return the adjoint of that?
That is what I am suggesting. However, I think we should probably find out if there is a reason for the TextAnalysis choice. I personally don't have much hands on experience in this area, and that's why I ask. Also, if we use TextAnalysis as is but follow my suggestion, there will be a performance hit as we are effectively changing the axis of index compression. (It would be better to re-implement the TextAnalysis method.) I'll open a query on the natural-language slack channel.
I can experiment with some different approaches. However, essentially what you are suggesting here is we build the matrix as p x n and then return the adjoint of that?
That is what I am suggesting. However, I think we should probably find out if there is a reason for the TextAnalysis choice. I personally don't have much hands on experience in this area, and that's why I ask. Also, if we use TextAnalysis as is but follow my suggestion, there will be a performance hit as we are effectively changing the axis of index compression. (It would be better to re-implement the TextAnalysis method.) I'll open a query on the natural-language slack channel.
Yeah - as I've been going through the code, I think it would just be easier to re-implement the DTM ourselves, at least for now.
@pazzo83
Thanks again for this work.
I think my comments above at least cover the MLJ-API related stuff. A last suggestion is that we separate the transformer (and corresponding test file) into separate files, but it's probably less messy if do that after merging this PR.
Since @storopoli has offered to help out, and I expect he has more NLP knowledge than I do, I suggest we see if @storopoli can do a formal code review when you're ready. If it's possible, I should like to limit my involvement to the integration side of things.
Are you happy for me to make the repo public to facilitate a review?
@ablaom That sounds great! Thanks for all your help here!
Done.
Ok @ablaom @storopoli - I made a few additional changes. I updated the deps - is it ok to have MLJModelInterface
set to just 1.3? I'll use whatever the convention is there (figured I'd bump it up to at least that since 1.3.2 is the current version). I also implemented the switch in orientation of the actual TF-IDF matrix that we discussed. It does seem to be quite a bit faster - and works just fine in some of the local tests I have.
Also, you all might notice that we keep failing in the Julia 1.3 check. I was doing some testing with different Julia versions (thank you Google Colab), and the issue appears to be how I am specifying types in the build_corpus
method. As you see, I've defined more or less a type alias for NGrams as such:
const NGram{N} = NTuple{<:Any,<:AbstractString}
Then I use that type alias for multiple dispatch in a function that constructs the proper TextAnalysis.Corpus
object:
_convert_bag_of_words(X::Dict{NGram, Int}) =
Dict(join(k, " ") => v for (k, v) in X)
build_corpus(X::Vector{Dict{NGram, Int}}) =
build_corpus(_convert_bag_of_words.(X))
build_corpus(X::Vector{Dict{S, Int}}) where {S <: AbstractString} =
Corpus(NGramDocument.(X))
build_corpus(X) = Corpus(TokenDocument.(X))
In versions prior to Julia 1.5, the compiler blows up when it tries to deal with a Dict
with the NGram
type alias as a key. So something like this:
test_dict2 = Dict{NGram, Int}(("a", "b") => 1, ("c", "d", "e") => 2)
causes Julia to crash. However, starting in 1.5, it does just fine. Should we bump up the Julia version in Compat
to 1.5?
I updated the deps - is it ok to have MLJModelInterface set to just 1.3? Yep, that's fine.
Re your fail, I'm guessing your have some missing <:
. So what if you try:
_convert_bag_of_words(X::Dict{<:NGram, <:Integer}) =
Dict(join(k, " ") => v for (k, v) in X)
build_corpus(X::Vector{<:Dict{<:NGram, <:Integer}}) =
build_corpus(_convert_bag_of_words.(X))
build_corpus(X::Vector{<:Dict{S, <:Integer}}) where {S <: AbstractString} =
Corpus(NGramDocument.(X))
build_corpus(X) = Corpus(TokenDocument.(X))
:exclamation: No coverage uploaded for pull request base (
dev@7349f5c
). Click here to learn what that means. The diff coverage isn/a
.
@@ Coverage Diff @@
## dev #1 +/- ##
======================================
Coverage ? 80.48%
======================================
Files ? 1
Lines ? 82
Branches ? 0
======================================
Hits ? 66
Misses ? 16
Partials ? 0
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7349f5c...108b732. Read the comment docs.
Initial commit - a few notes
String
s or a vector ofStringDocument
s from the TextAnalysis package. Then the various fit functions apply the appropriate conversions, create n_grams (if necessary), etc. If you think this isn't the right approach, I am definitely open to other ways.This is definitely in a more raw state than the TSVD code - so please let me know what you think should be changed/removed/added. Thanks!