bluesheeptoken / CPT

Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction (cython implementation)
https://cpt.readthedocs.io/en/latest/
MIT License
41 stars 7 forks source link

Predicts same item for every sequence #80

Closed abdulbasitds closed 4 years ago

abdulbasitds commented 4 years ago

Hello,

There is either something extremely wrong with the implementation or I havent understood it right. for the follwoing sequences it predicts same item (which is not the case in actual). so on all 24 sequences, at time of testing ( if i provide n-1 items, removing last one

for instance, here is two sequence from 24 training seq data

['SODL', 'TCU', 'IPC', 'TCM', 'IPMB', 'GWM', 'GWM', 'TCM', 'IPC', 'SODL', 'SODR', 'CHCM', 'GWM', 'PSCM', 'TCU', 'ABS', 'ABS', 'IPMA', 'IPMA', 'SODR', 'RCM', 'PSCM', 'IPMA']
['TCM', 'IPC', 'CHCM', 'TCM', 'ABS', 'ABS', 'ABS', 'ABS', 'IPMB', 'PSCM', 'PSCM', 'TCU', 'ABS', 'ABS', 'ABS', 'ABS', 'TCM', 'RCM', 'PSCM', 'IPMA', 'IPMA']

out of 24, it produces

'SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODL,SODR,TCU,TCU'

Which is quite strange as sequences are quite different

bluesheeptoken commented 4 years ago

Hello, your dataset does not seem that big, can you email it to me (you can anonymize the data if needed). I will give it a look.

Usually, this comes from the size of the alphabet. I will try to debug the outputs using find_similar_sequences and retrieve_sequence.

There can be several solutions which I might expose depending on your dataset

bluesheeptoken commented 4 years ago

With a random guess, I would say SODL is the first item of your first sequence. By definition CPT cannot predict an item already present in the sequence. The length of the sequences can significantly impact the quality of the predictions.

For a benchmark, you can check: http://www.philippe-fournier-viger.com/ADMA2013_Compact_Prediction_trees.pdf

Especially this paragraph:

. CPT takes advantage of a longer prefix by finding more precise (longer) patterns in its prediction tree, to yield a higher accuracy.The accuracy of CPT gets higher as the prefix length is raised. But after the prefix  reaches  a  length  of  8  (specific  to  the  dataset),  the  accuracy  decreases.This is because the algorithm may not be able to match a given prefix to any branches in the prediction tree. It means that this parameter should be finely tuned for each dataset to maximize the accuracy

A simple check to do can be to split the test set by taking only the last 5 items to predict of each of your sequence:

prediction_sequence_length = 5
prediction_set = [row[-prediction_sequence_length-1:-1] for row in training_set]
abdulbasitds commented 4 years ago

@bluesheeptoken Thank you for the help and response.

"By definition CPT cannot predict an item already present in the sequence" SO does it means that that I should not be getting SODL as it is already in the sequence? and there is something wrong with my usage?

Also, I got the point of length of sequence and keeping it to 5. But didnt get why it is linked with first item of the first sequence.

Can you please share you email, I will send anonymized data.

bluesheeptoken commented 4 years ago

The first item of the first sequence is linked to the fact that if CPT does not know what to predict, it predicts the first letter of your alphabet (here, the first item of the first sequence)

My mail adress can be found here: https://github.com/bluesheeptoken/CPT/blob/master/setup.py#L27, louis.fruleux1[at]gmail.com

abdulbasitds commented 4 years ago

Okay thank you.

I have sent you the CSV file. Thanks again for the help.

bluesheeptoken commented 4 years ago

I have sent back some ideas to improve quality of predictions, can you tell me if you find them useful ?

abdulbasitds commented 4 years ago

@bluesheeptoken Thank you for your prompt response and and your guidance through email, it was definitely very helpful.

Thanks for the awesome work.