ferristseng / rust-tfidf

Apache License 2.0
18 stars 4 forks source link

a performance question and proposal to expose the `Idf` trait #5

Open btc opened 3 years ago

btc commented 3 years ago

Hi! Thanks for publishing this software. It's quite helpful to potentially be able to use your library instead of re-implementing TFIDF myself. I am grateful for the time and attention you've given to this.

For my use-case, I am working with a large corpus of documents and trying to understand if I can use this library in a way which will have suitable performance.

Examples in this repo show examples of the form:

for term in terms:
  for doc in docs:
    score = compute_tfidf(term, doc) 

where compute_tfidf is either TfIdfDefault::tfidf or MyTfIdfStrategy::tfidf


  1. Is it true that the idf implementations exposed by this crate all require a O(n) linear iteration over the documents/corpus?
  2. Is it possible to use the idf functions on their own, without going through tfidf?

Presented in pseudocode here, I would like to do the following:

for term in terms:
  idf = compute_idf(term, docs)
  for doc in docs:
    score = compute_tf(term, doc) * idf

Concretely, I have tried to use the library in the following way, but ran into an error that I don't quite understand yet:

use tfidf::idf::{InverseFrequencySmoothIdf};
use tfidf::tf::DoubleHalfNormalizationTf;
use tfidf::{Tf, Idf};

for term in terms {
  let idf = idf::InverseFrequencyIdf::idf(term, docs)
  for doc in docs {
    let tf = tf::DoubleHalfNormalizationTf::tf(term, doc)
    let tfidf = tf * idf
  }
}

Error 1:

Unresolved import `tfidf::Idf`

Error 2:

No function or associated item named `idf` found for struct `InverseFrequencySmoothIdf` in the current scope

Further exploration has led me to discover that this may be occurring simply because Idf isn't exposed. Would it be okay for me to submit a patch which modifies lib.rs to expose Idf?

by changing:

pub use prelude::{
  Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
  SmoothingFactor, Tf, TfIdf,
};

proposed:

pub use prelude::{
  Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
  SmoothingFactor, Tf, TfIdf, Idf,
};
ferristseng commented 3 years ago

Further exploration has led me to discover that this may be occurring simply because Idf isn't exposed. Would it be okay for me to submit a patch which modifies lib.rs to expose Idf?

Hi! That would be fine to submit that patch. I'm surprised that that trait isn't exposed already!

Is it true that the idf implementations exposed by this crate all require a O(n) linear iteration over the documents/corpus?

Just an FYI, I'm only passively maintaining this project. I'm not sure if I'm able to answer your question, since I haven't look at this code in a long time.

btc commented 3 years ago

Hi! That would be fine to submit that patch. I'm surprised that that trait isn't exposed already!

Here you go:

https://github.com/ferristseng/rust-tfidf/pull/6