Implement SON algorithm

Tortar commented 3 weeks ago

it is for example described in http://infolab.stanford.edu/~ullman/mmds/ch6.pdf in 6.4.3. It should be not too difficult because in the end it is like applying Apriori to transactions splitted in slices. It would help to search into much larger datasets if slices are analyzed sequentially.

JaredSchwartz commented 3 weeks ago

This is super interesting!

I don't think I'll personally be able to implement this since I don't have the expertise in distributed computing in Julia or a cluster to test on. I'm happy to support you if you want to have a go at implementing it though!

I'm sure I could build the scaffolding for this, like adding an n_lines kwarg to load_transactions() that along with the existing skiplines would allow for chunking the transactions from a file (probably a good idea regardless).

Tortar commented 3 weeks ago

yes, indeed I think having something like n_lines would be useful. I actually would like to try it out, it would not require necessarily expertise in distributed computing though, because I think it should be okay to express it, at least initially, as a sequential run through all the chunks, such that each chunk fits in main memory.

JaredSchwartz commented 3 weeks ago

Oh, I see. So implementing it on on a single compute node, but just chunking the input and calculating support using the method in that section. The paper's focus in that section seemed to be on parallel and distributed compute, so that's why I got nervous haha.

I'll make a separate issue and work on building chunking into the load_transactions() function (hopefully sometime this week).

I can also try out implementing this SON algorithm in this single-node configuration and we can see what seems like it works.

The only thing I think we'll run into is that the performance will be much slower due to all of the file I/O that will have to happen for each round of candidate set generation. It will make this only really suitable for datasets which are extremely large relative to the available system memory.

Tortar commented 3 weeks ago

Surely it will have some I/O overhead but when you say "at each round of candidate set generation" I don't know if it is what I have in mind, because after a chunk is put in the ram, we can actually go as far as we want with the size of candidate sets I think. So that we only need an I/O read for each chunk

Tortar commented 3 weeks ago

The paper's focus in that section seemed to be on parallel and distributed compute, so that's why I got nervous haha.

yes, indeed I now linked to the previous section which is less focused on distributed computing :-)

JaredSchwartz commented 3 weeks ago

Surely it will have some I/O overhead but when you say "at each round of candidate set generation" I don't know if it is what I have in mind, because after a chunk is put in the ram, we can actually go as far as we want with the size of candidate sets I think. So that we only need an I/O read for each chunk

So I was writing an explanation, but actually section 6.2.6 of the chapter you linked does a pretty good job explaining what I'm referring to. You'll have to load the chunks for each filter step after candidate generation if you want to only ever have one chunk loaded at a time because you will need to have loaded all chunks and calculated all supports before being able to move on to larger rules.

In my code, the loop is called "level". The book refers to it as sets of size $k$

Tortar commented 3 weeks ago

mmmh, I think that in the context of the SON algorithm it is possible to have the chunks loaded only once, if you look at section 6.4.4. you can see that the book describes the first map task as applying APriori to each chunk with the reduced support threshold. So if we go sequentially, we can store in main memory only the frequent itemsets for each chunk. After that, we remove false positives with a second pass.

Tortar commented 3 weeks ago

In practice, to my understanding, you can avoid filtering at each set size because you allow for false positives, and you remove them at the end with a second pass

JaredSchwartz commented 3 weeks ago

The way I understand that section says is that you make a first pass by calculating supports chunk-wise that contains no false negatives but then there is a second pass where the supports for the potential candidates are summed across the entire dataset to ensure there are no false positives. This is their proposed method for the candidate generation/filtering step in Apriori and produces the actual candidates with no false positives or false negatives. You would need (number of chunks max rule length 2) I/0 reads.

What I understand you to be saying is that you skip the second pass and do it at the end of the process. Then you would only need (number of chunks) I/O reads. I suppose this could work, but it's risky:

You risk wasting a bunch of computation on calculating support for false positives and all supersets of those false positives. This problem is exponentially related to the number of items per transaction, as in datasets with high numbers of items per transaction, you will probably see combinatorial explosion with additional false base nodes.
You also have to go back and filter your output array or have some sort of intermediary array instead of just not adding the invalid rules to the output array in the first place. Depending on the data structure, that may not be very easy.

I want to be clear, I think this is doable the right way, with all the I/O at each iteration. I don't think the I/O reads are going to make the thing absolutely crawl, especially because my transactions reader is really fast :) It's just going to be noticeably slower than loading everything into memory for the vast majority of desktop-sized datasets (<5M rows) out there.

Tortar commented 3 weeks ago

yes, then we agree :+1: I don't think that it will always explode in terms of supersets of false positives, so I think that a flag which decides if filtering out false positives at each step or not could be a good idea, this would reduce the number of IO reads at the cost of more memory usage of false positives supersets.

JaredSchwartz commented 3 weeks ago

Sure, that just adds additional complexity to the function, which is more to design and build and test 🫠

Tortar commented 3 weeks ago

yes, it's always a matter of how much time one has to spend, I know something about it ahah, very cool package nonetheless, I think Julia was definitely missing something like this

JaredSchwartz / RuleMiner.jl

Implement SON algorithm #8