Feature Request: get attributes associated with mined pattern

chilligerchief commented 1 year ago

First of all, let me thank you for this amazing library. The pattern mining works like a charm.

Right now I am looking for a possibility to use the patterns in the context of the original sequence. The easiest option IMHO would be to view, in which sequences the mined pattern occured.

As far as I understood the algorithms, this information is stored in the MDD, alongside the attributes for each item in the sequence, which I am also interested in. Would it be possible for you to include an option, that returns these informations with the mined patterns, i.e. return a pattern in the following form:

[{'item':'A', time:[1,1,1,2,3]}, {'item':'B', 'time':[2,3,2,4,5]}, [1,2,5,8,12]]

Here, in addition to the items of a pattern, the associated attribute list is attatched (in this case only the 'time' attribute). And instead of (or possibly in addition to) the count of the occurences of the pattern, a list with the index of the input sequence is attatched. In this case the pattern occured in sequences 1,2,5,8 and 12.

This is only a suggestion for the output format, feel free to adapt it to the current internal data structure. I am not sure how complex it is to add this information. I think for starters, the sequence IDs would be really helpful.

Please let me know if this is something you would inlcude in the library.

skadio commented 1 year ago

Hi @chilligerchief, thank you for the very positive comments! Glad you enjoy the library --and please star it if it is useful in your applications.

Before jumping into how to make this work, I want to make sure to understand the behavior.

Given a set of sequences, Seq2Pat returns frequent patterns (subject to constraints). Am I correct that, on top of that, you would like to know which sequences in the original dataset exhibit the found patters. For example, found_pattern_1 is a sub-sequence of original_sequence_5, found_pattern_2 is a sub-sequence of original_sequence_10 and original_sequence_15, and so on.

Is that correct?

chilligerchief commented 1 year ago

Yes, that is perfectly correct. Maybe that is already possible and I am not finding the right feature ;)

Additionally, it would be super helpful, if the associated attributes can somehow be accessed, for example found_pattern_2 in origianal_sequence_10 had the following time attributes attributes_2_10 and in original_sequnce_15 attributes attributes_2_15.

But we can take it one step after the other.

EDIT: Both would be useful to find and label the patterns in the original dataset, where additional information is stored, so that the pattern information can be used in further processing steps.

skadio commented 1 year ago

well.. I have good news for you. What you are after might already be available. Essentially, one step is to go from sequences to patterns, Seq2Pat, and the other step is to go from patterns to (one-hot) features, Pat2Feat.

Check out the Dichotomic Pattern Mining example: https://github.com/fidelity/seq2pat#dichotomic-pattern-mining

This might already cover what you want. Specifically, see the one-hot feature generation part

pat2feat = Pat2Feat()
encodings = pat2feat.get_features(sequences, patterns)

Here sequences are the original input and patterns are the output of pattern mining for patterns found.

The output encodings are one-hot encoding for each sequence, with a 1 if the pattern can be found in that sequence.

Regarding the second part about attributes, given the one-hot encoding table, you can do a look-up.

Hope this helps,

chilligerchief commented 1 year ago

Thank you, that definitely helps. However, it seems to be horribly slow. Mining the patterns takes ~2sec on my test dataset, and generating the features takes ~30sec. (edit: it should be noted that i have very strong constraints which definitely makes pattern mining much faster, maybe they can also be used when finding the features?)

Wouldn't it be more efficient to generate the features while mining the patterns?

About the lookup, I can easily look up which patter occurs in which sequence. Is there also a way to find where the pattern begins, or even better at which indices of the sequence the pattern occurs? e.g the pattern 1 is found in sequence 5 at index 2,4,7.

skadio commented 1 year ago

Agree on generating features would be more efficient while mining (but changing our backend would be a major update, and is out of scope for now). That's why it takes longer to find them afterwards.

On that note, I challenge you on your comment :) "I can easily look up which patter occurs in which sequence"

It is not that easy, and that's what Pat2Feat is trying to do. I would be very curious to learn if you can easily generate these one-hot feature vectors.

chilligerchief commented 1 year ago

I am most likely doing something wrong here, so let me just check that with you: I have an input list of sequences sequences and a list of mined patterns patterns, where i dropped the last item (the count). Then I run encodings = pat2feat.get_features(uiellist, patterns, drop_pattern_frequency=False) and get one-hot encoded matrix of pattern occuring in my dataset. It seems to label a lot of false positives, since it does not take the constraints into account. Is there a way to specify the constraints? If I write a similar algorithm (pure python with numpy, no optimizations whatsoever) like this

for i, sequence in enumerate(sequences):
    seq = np.array(sequence)
    for j, pattern in enumerate(patterns):
        currentIndex = -1
        patterninside = True
        for pat in pattern:
            indices = np.where(seq == pat)[0]
            if np.all(indices <= currentIndex):
                patterninside = False
                break
            else:
                for index in indices:
                    if index > currentIndex:
                        currentIndex = index
                        break
        if patterninside:
            onehotmatrix[i, j] = 1

I get better performance (it runs in 2/3 of the time) and it actually labels more sequences, which seemed like an error, but I checked them, and without applying constraints, they exhibit the labeled pattern.

Can you check my algorithm and verify? I am probably misunderstanding the problem, so it would be great if you could clear things up.

chilligerchief commented 1 year ago

I created a small test setup, where you can see my findings:

https://github.com/chilligerchief/pat2feat_test

I used a smaller dataset, so it runs much quicker, but the relative difference stays roughly the same, and you can check some examples of the patterns that are labelled diffrently.

skadio commented 1 year ago

Have you looked at the API documentation of get_features() function?

Two comments:

First, see that there is a max_span parameter that defines the rolling window of where to search for the found patterns within sequences. I think the default is 10 or sth like that.. So if you run pat2feat() vs. find things brute force manually, the two might not give the same results.. Btw, you can try the max_span to None to search globally, but it will be slower than local search.

Second, see that there is a constraints parameter. Whether you just want to search the pattern within the sequence OR you want that + enforcing the attributes depend on the constraints.

Please see the usage example in the dichotomic mining notebook. cc: @takojunior

chilligerchief commented 1 year ago

Thanks for your answer. I missed the max_span parameter, that seemed to be the reason why brute force got different results. I still find it kind of odd, that the brute force solution is faster, but of course, your algorithm has to also work with constraints.

Also thanks for claryfying, that constraints have to be added as a parameter in the function call. Somehow, I expected to add constraints the same way I did with seq2pat...

I really appreciate your help! I will mark this issue as closed.

skadio commented 1 year ago

Thanks for the fun brainstorming!

I still think that you have a valid point and your idea of speeding this up is in the right direction.

If you find a way to speed up pat2feat() function —we would love to see your PR on that.

Thank you again for using the library, and plz star it if you find it useful

All the best, Serdar

fidelity / seq2pat

Feature Request: get attributes associated with mined pattern #38