Marcuccio / cvm

sampling-based space-efficient algorithm to estimate the number of distinct elements in streams
0 stars 0 forks source link

Data streams #1

Open BrandonDyer64 opened 4 months ago

BrandonDyer64 commented 4 months ago

I see the estimate() function is limited to a vector of String, but the CVM algorithm is a lot more powerful than that.

Why not a function signature like this?

fn estimate<T>(source: impl IntoIterator<Item = T>, delta: f64, epsilon: f64) -> usize
where
    T: PartialOrd + PartialEq,

Or a structure?

let mut estimator = Estimator::<String>::new();

estimator.push("hello".to_string());
estimator.push("bye".to_string());
estimator.push("hello".to_string());

let num_words = estimator.estimate();

assert_eq!(num_words, 2);
Marcuccio commented 4 months ago

Is a great idea, I also like the idea of using a structure to manage the estimation process..

I’ll look into implementing these enhancements. If you have any additional thoughts or would like to contribute directly, I’d be more than happy to collaborate. Your feedback is invaluable, and I appreciate the time you've taken to provide it.

Thx!

Marcuccio commented 4 months ago

@BrandonDyer64 I implemented the generic signature for the estimate function as you suggested.

I was wondering if it had sense to do the same for the estimate_from_many as well. Right now it takes a &[PathBuf]

see: 1-data-streams

BrandonDyer64 commented 3 months ago

@Marcuccio if you make it into a Pull Request, I can take a look at it

Marcuccio commented 3 months ago

@BrandonDyer64 the estimate_from_many function is not modified right now, it still takes a &[PathBuf] and is used by the mainfunction.

But maybe, would be cool to accept generics even there... What do you think?