Closed JaredSchwartz closed 3 weeks ago
Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.
:information_source: You can also turn on project coverage checks and project coverage reporting on Pull Request comment
Thanks for integrating Codecov - We've got you covered :open_umbrella:
Reworked both the A Priori Algorithm and ECLAT algorithms to operate faster. Depending on the dataset and parameters, I'm seeing:
A Priori Changes:
Added transaction database subsetting
In the old implementation, each child node search was also calculating support for all columns that don't meet the minimum support threshold. With this change, the algorithm now only calculates support for the base nodes in each iteration.
Added hashing and deduplication of rules during generation
A Priori tends to create redundant rules where the LHS will have the same items in a different order. This can be solved with sets or bitsets, but converting to and from these to vectors to slice the transactions matrix is very slow.
The solution used here keeps the vector architecture for fast matrix slicing, but removes redundant rules at each level of calculation. This prevents further rule generation based on the redundant rules, and it provides greater resource savings the greater maximum length of the rules being generated.
ECLAT Changes
Enabled multithreading
Enabling multithreading in ECLAT is tricky because of its recursive nature. This approach multi-threads at the first equivalence class, keeping the threads entirely independent.
This speeds up ECLAT significantly, but the amount of computation each thread does is very uneven due to the search pattern in ECLAT. Adding additional cores will see diminishing returns on speed gains.
Testing Changes
Sorted itemsets generated by ECLAT
Sorting the itemsets is now necessary with the introduction of multithreading.