JaredSchwartz / RuleMiner.jl

Association Rule Mining in Julia
MIT License
6 stars 0 forks source link

Implement sampling for APriori #9

Open Tortar opened 1 week ago

Tortar commented 1 week ago

When the dataset doesn't fit in main memory it could be useful to allow to just analyze a single sample instead of all the dataset. This would require to adjust the support to p*s where p is the sample proportion and s is the support required. This would be approximate, but it is very simple to implement as you see (for a reference look at 6.4.1 of http://infolab.stanford.edu/~ullman/mmds/ch6.pdf)

JaredSchwartz commented 1 week ago

I wonder if it would be better to write this as a function which operates on the Transactions object that samples down the transactions directly.

This would have the advantage of being able to use the the sampled transactions with other algorithms, but it would also open up the the user to error for not adjusting their support parameter.

Tortar commented 1 week ago

Seems even better to me!

Tortar commented 1 week ago

Ah yes sorry re-evaluating after your last sentence. Yes, that could be a problem, I don't know if there is a way to avoid that correctness trap

Tortar commented 1 week ago

Maybe allowing both things could be useful, internally we will use the sampling methods on the transactions. But a method to sample transactions is also exposed publicly

JaredSchwartz commented 1 week ago

Yeah, I just think that you lose some of the memory usage advantages if you are sampling in the function instead of sampling the Transactions beforehand.

Obviously, the actual algorithm runs faster, but if you are worried about fitting data into memory, this doesn't really solve the issue and actually increases memory usage if you have both the original and the sampled copy in-memory.

I also feel like the correctness issue is on the user a little bit. If we adjust things behind the scenes, the outputs won't line up with what they input unless we also scale the outputs, right? Like if they put in a support parameter of 0.5 and a sampling proportion of 0.1, the output would show all rules which have a support >= 0.05, which may also not be intuitive for people who don't understand how sampling affects the results.