JuliaText / TextAnalysis.jl

Julia package for text analysis
Other
373 stars 95 forks source link

Methods to merge two DocumentTermMatrix instances #243

Closed tanmaykm closed 3 years ago

tanmaykm commented 3 years ago

I am trying to implement incremental updates to a tf_idf matrix built from a corpus. There may also be a need to remove certain documents / terms from the matrix. Manipulating at the document term matrix seems to be an efficient way to do that (compared to starting from the corpus).

So, seems like creating a small incremental document term matrix with new documents and merging it with the previous full document term matrix would probably be a good way. Similarly methods to remove documents and terms from the matrix will also be useful.

It will be useful to have such methods available in this package.

aviks commented 3 years ago

I haven't put much thought into this, but it seems that if we keep the same lexicon, merging the DTMs is a matter of adding the matrices... is that right? There should be a way of generating the DTM with a specified lexicon. If we want to add new words to the lexicon from the incremental corupus, things get more complex.