Closed ablaom closed 2 years ago
I really like this! I'll update the TFIDF transformer to adopt these types.
Is the idea for an entire Corpus of documents to be represented as one Document
object that is comprised of a vector of String
s? For the TfidfTransformer
, the ideal input would basically be a vector of strings or vector of some Document type.
No, I wouldn't have thought so. Copied from my slack response:
Sounds like you your’re thinking about having your transformer process untokenized data. But perhaps tokenization should be left to a separate transformer? There seem to be a few of these about. In that case your allowed input is a vector of tokenized documents, that is “document” is something with machine type CorpusLoaders.Document{<:AbstractVector{<:AbstractString}}
(ignoring the possibility of tagged words here, for simplicity) and your transformer is sucking in a vector of these (a corpus). You would then articulate this type requirement by declaring input_scitype{<:Type{<:YourTransformer}} = AbstractVector{<:Annotated{AbstractVector{Textual}}
. Yes?
closed by https://github.com/JuliaAI/ScientificTypes.jl/pull/153 . See also #158
Following on from this discussion I propose we add an implementation of
scitype
that we mark as experimental (changeable without a breaking release).Initially I thought of implementing something using the
TokenDocument
from TextAnalysis.jl but I rather think a nicer class of objects is theDocument{T}
type fromCorpusLoaders
, which has the added benefit of being much more light-weight than TextAnalysis.jl. It also defines aTaggedWord
abstract type, together with a bunch of useful concrete subtypes which are used in all the corpora you can load from that package.@pazzo83 I wonder what you think of having your transformer sucking in data as some form of
CorpusLoaders.Document{T}
. This type is defined here. See also this comment of @oxinabox. I don't think converting such documents (with suitably restrictedT
) would be hard to convert toTokenDocument
if that was convenient for you to do internally.An implementation of scitype along these lines is drafted here. This is from the tests:
@storopoli