Closed abdulbasitds closed 4 years ago
Hello, your dataset does not seem that big, can you email it to me (you can anonymize the data if needed). I will give it a look.
Usually, this comes from the size of the alphabet
. I will try to debug the outputs using find_similar_sequences
and retrieve_sequence
.
There can be several solutions which I might expose depending on your dataset
With a random guess, I would say SODL
is the first item of your first sequence. By definition CPT cannot predict an item already present in the sequence. The length of the sequences can significantly impact the quality of the predictions.
For a benchmark, you can check: http://www.philippe-fournier-viger.com/ADMA2013_Compact_Prediction_trees.pdf
Especially this paragraph:
. CPT takes advantage of a longer prefix by finding more precise (longer) patterns in its prediction tree, to yield a higher accuracy.The accuracy of CPT gets higher as the prefix length is raised. But after the prefix reaches a length of 8 (specific to the dataset), the accuracy decreases.This is because the algorithm may not be able to match a given prefix to any branches in the prediction tree. It means that this parameter should be finely tuned for each dataset to maximize the accuracy
A simple check to do can be to split the test set by taking only the last 5 items to predict of each of your sequence:
prediction_sequence_length = 5
prediction_set = [row[-prediction_sequence_length-1:-1] for row in training_set]
@bluesheeptoken Thank you for the help and response.
"By definition CPT cannot predict an item already present in the sequence" SO does it means that that I should not be getting SODL
as it is already in the sequence? and there is something wrong with my usage?
Also, I got the point of length of sequence and keeping it to 5. But didnt get why it is linked with first item of the first sequence.
Can you please share you email, I will send anonymized data.
The first item of the first sequence is linked to the fact that if CPT does not know what to predict, it predicts the first letter of your alphabet (here, the first item of the first sequence)
My mail adress can be found here: https://github.com/bluesheeptoken/CPT/blob/master/setup.py#L27, louis.fruleux1[at]gmail.com
Okay thank you.
I have sent you the CSV file. Thanks again for the help.
I have sent back some ideas to improve quality of predictions, can you tell me if you find them useful ?
@bluesheeptoken Thank you for your prompt response and and your guidance through email, it was definitely very helpful.
Thanks for the awesome work.
Hello,
There is either something extremely wrong with the implementation or I havent understood it right. for the follwoing sequences it predicts same item (which is not the case in actual). so on all 24 sequences, at time of testing ( if i provide n-1 items, removing last one
for instance, here is two sequence from 24 training seq data
out of 24, it produces
'SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODR,TCU,TCU'
Which is quite strange as sequences are quite different