biolab / orange3-associate

🍊 :package: Frequent itemsets and association rules mining for Orange 3.
Other
65 stars 37 forks source link

Feature Request #25

Closed Sanjoth closed 5 years ago

Sanjoth commented 5 years ago

Hi.

In the orange3-associate python module, can we have a max-length parameter for frequent_itemsets function which will limit the length of itemsets generated?

Example {1,2} is an itemset of length 2 {1,2,3} is an itemset of length 3

If I can limit the length of an itemset so that itemsets above a threshold aren't even generated, it will be really awesome. Currently, I am working on big data and from what I see, itemsets of all lengths are generated, which takes a large chunk of memory and time if support is less.

kernc commented 5 years ago

frequent_itemsets() is a generator function, generating itemsets as it goes. You can limit the itemsets of interest to you by filtering for them:

for itemset, support in frequent_itemsets(T, .05):
    if 2 <= len(itemset) <= 3:
        print("Match:", itemset)
Sanjoth commented 5 years ago

Yes, but will it be possible for the function to ignore itemsets of that length internally so that those combinations are not processed.

To give some context, I was previously using apriori from mlxtend and it had an additional parameter of max_len. API Documentation: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/#API It would ignore any itemsets above max_len parameter internally and would save a great deal of space and time considering the fact that it's apriori.

The type of data I'm working with has 5-6 lengths of itemsets, and the data is huge. It hogs > 12 GB of memory with the current implementation of fpgrowth using a similar filtering logic that you wrote.

My only option is to increase the support threshold but it might remove some interesting patterns. I am not sure if it is possible, in terms of algorithm, but if it is then it would be great for my use-case.

kernc commented 5 years ago

The source code is open, tidy, fairly well documented, and it follows the paper to the letter (marked with §sections). Patches with improvements shall be reviewed favorably.

For your large data, maybe a more light-weight and optimized (i.e. non-python) solution would work better.

Sanjoth commented 5 years ago

Okay, thank you.