Add an experimental implementation of scitpye for text analysis

ablaom commented 2 years ago

Following on from this discussion I propose we add an implementation of scitype that we mark as experimental (changeable without a breaking release).

Initially I thought of implementing something using the TokenDocument from TextAnalysis.jl but I rather think a nicer class of objects is the Document{T} type from CorpusLoaders, which has the added benefit of being much more light-weight than TextAnalysis.jl. It also defines a TaggedWord abstract type, together with a bunch of useful concrete subtypes which are used in all the corpora you can load from that package.

@pazzo83 I wonder what you think of having your transformer sucking in data as some form of CorpusLoaders.Document{T}. This type is defined here. See also this comment of @oxinabox. I don't think converting such documents (with suitably restricted T) would be hard to convert to TokenDocument if that was convenient for you to do internally.

An implementation of scitype along these lines is drafted here. This is from the tests:

    tagged_word = CorpusLoaders.PosTaggedWord("NN", "wheelbarrow")
    tagged_word2 = CorpusLoaders.PosTaggedWord("NN", "soil")
    @test scitype(tagged_word) == Annotated{Textual}
    bag_of_words = Dict("cat"=>1, "dog"=>3)
    @test scitype(bag_of_words) == Multiset{Textual}
    bag_of_tagged_words = Dict(tagged_word => 5)
    @test scitype(bag_of_tagged_words) == Multiset{Annotated{Textual}}
    @test scitype(Document("kadsfkj")) == Unknown
    @test scitype(Document([tagged_word, tagged_word2])) ==
        Annotated{AbstractVector{Annotated{Textual}}}
    nested_tokens = [["dog", "cat"], ["bird", "cat"]]
    @test scitype(Document(nested_tokens)) ==
                  Annotated{AbstractVector{AbstractVector{Textual}}}

@storopoli

pazzo83 commented 2 years ago

I really like this! I'll update the TFIDF transformer to adopt these types.

pazzo83 commented 2 years ago

Is the idea for an entire Corpus of documents to be represented as one Document object that is comprised of a vector of Strings? For the TfidfTransformer, the ideal input would basically be a vector of strings or vector of some Document type.

ablaom commented 2 years ago

No, I wouldn't have thought so. Copied from my slack response:

Sounds like you your’re thinking about having your transformer process untokenized data. But perhaps tokenization should be left to a separate transformer? There seem to be a few of these about. In that case your allowed input is a vector of tokenized documents, that is “document” is something with machine type CorpusLoaders.Document{<:AbstractVector{<:AbstractString}} (ignoring the possibility of tagged words here, for simplicity) and your transformer is sucking in a vector of these (a corpus). You would then articulate this type requirement by declaring input_scitype{<:Type{<:YourTransformer}} = AbstractVector{<:Annotated{AbstractVector{Textual}} . Yes?

ablaom commented 2 years ago

closed by https://github.com/JuliaAI/ScientificTypes.jl/pull/153 . See also #158

JuliaAI / ScientificTypes.jl

Add an experimental implementation of scitpye for text analysis #154