bluesheeptoken / subseq

Succinct BWT based sequence prediction: a lossless model for sequence prediction
MIT License
17 stars 2 forks source link

probability #17

Closed localcreator closed 1 year ago

localcreator commented 1 year ago

Hi! Is there any way I can see the probability? Something like:

model = Subseq(1) model.probability()

bluesheeptoken commented 1 year ago

Hey @localcreator Sorry for the late response, I don't get notifications and I don't know why...

What would be the probability here? The % of each "word" in the "corpus"? Or maybe you meant the probability of a prediction given a sequence?

localcreator commented 1 year ago

Hi! Simple probability in %, just as it is implemented in sklearn. This fits better: "Or maybe you meant the probability of a prediction given a sequence?"

bluesheeptoken commented 1 year ago

I am not familiar with probabilities in sklearn. Are they relative probabilities with the different outcomes? Or are they absolute probabilities?

For instance, if the algorithm sees the two outcome A and B. And A is 3 times more likely, does it return A: 75%, B: 25% or more something like A: 6%, B: 2%, Others: 92%?

Currently, none of them are actually implemented. But I guess we could easily do something about the relative probabilities as an indicator and not an absolute truth

localcreator commented 1 year ago

This fits better: "For instance, if the algorithm sees the two outcomes A and B. And A is 3 times more likely, does it return A: 75%, B: 25% or more something like A: 6%, B: 2%, Others: 92%?" model.probability() simply prints these probabilities for each prediction. The result for the next forecast A, B, C, D may look like [0.7, 0.1, 0.05, 0.15]. And here it will be clear that the most likely forecast will be A[0.7]

bluesheeptoken commented 1 year ago

It would make sense to have them to see how more likely a prediction is compared to another.

I think we could compute this from the calculated weights from FrequencyArray. And this is what get_k_best_letter does to get the predictions.

If you would like to give it a shot, I would be happy to help. Otherwise, I will try to schedule this on my side, but this won't be done this week unfortunately

localcreator commented 1 year ago

Unfortunately, I have little experience editing someone else's code. So far, the only hope is that someday you will have time to do this

bluesheeptoken commented 1 year ago

Alright :+1:

Thanks for opening the issue, that's a good idea.

I will try to get time free soon :)

localcreator commented 1 year ago

Hi! Can I find out if there is any progress in solving this issue?

bluesheeptoken commented 1 year ago

Hello, I was going to give my pet projects a bit of love next week. I was also going to ask you if you still needs it. Apparently yes, I will try to release next version by next week :)

localcreator commented 1 year ago

Yes, of course, I have been waiting with great desire and still continue to wait for this update for more than a year! Somewhere I have already heard about one week, it seems it was last year)))

bluesheeptoken commented 1 year ago

Hello, please kindly note that this is an unsponsored open-source project. I am working on this in my free time and take no financial benefits from it. To be honest, there is no technical interest on my side to make this improvement, it was only to help you. Now, it's not even sure I will prioritize this.

localcreator commented 1 year ago

I understand, sorry for the unfortunate joke, sir. In any case, regardless of your desire to continue working on this project, I am sincerely grateful for your work done!

bluesheeptoken commented 1 year ago

Hello, I am having a look at this.

Would the weights be enough for your use case? Something like {'A': 0.04, 'B': 0.01} would be enough? And so on client side you could compute {'A': 80%, 'B': 20%}.

I could also display the total_weights, but I am afraid the total_weights would confuse more than it would help in this kind of case. What do you think?

I try to not take any mathematic liberties on the research paper I don't own. And I feel like showing percentages such as {'A': 0.75} would be a bit misleading, people would think subseq is sure at 75% that A is the prediction, which is wrong.

bluesheeptoken commented 1 year ago

I have drafted a small PR: https://github.com/bluesheeptoken/subseq/pull/20/files#diff-9e237807a107b7849785e2833d8866c65e7ac4c052836d17d08ddef7ea0558a7R28-R37

Would something like this fit well your needs? For instance, in the test linked, C got a weight of 8 and B a weight of 4. This is because C is present in 2 sentences, and B in only 1. (Note that these are not the only factor that impacts prediction, distance from the query also impact. TL;DR If the text contains ADC and AB, B will have a higher weight than C. Even if it is present in one sentence each)

localcreator commented 1 year ago

I don't know, it's worth a try. It is not yet clear why C is present in 2 sentences, but the weight is 8, and not, for example, 2 or 4. Here it is at your discretion, as you consider more correct.

bluesheeptoken commented 1 year ago

Because internal weights is a bit more subtile than just counting the number of sentence in which you can find the letter. Full code for reference is there. But maybe the research paper is clearer.

If that's okay for you, I will release this version tomorrow with a version for Python 3.11

localcreator commented 1 year ago

Thanks! Yes, of course, that would be great!

bluesheeptoken commented 1 year ago

That's published under version 1.0.4 ! I also released for python 3.11, you shouldn't have issues anymore.

Can you test and tell me if this is fine for your use case? If you enjoy the project, don't hesitate to star the project :)

localcreator commented 1 year ago

model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 3) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5} model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 4) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5, 'a0': 38.0} model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 5) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5, 'a0': 38.0, 'c0': 19.0} model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 4) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5, 'a0': 38.0} model.predict_k_with_weights(['d1', 'd0', 'c1', 'b1'], 4) {'a1': 86.5, 'd0': 77.0, 'd1': 63.0, 'c0': 42.5} model.predict_k_with_weights(['d1', 'd0', 'c1'], 4) {'a0': 617.0, 'a1': 612.5, 'd1': 534.0, 'd0': 511.5} model.predict_k_with_weights(['d1', 'd0', 'c1', 'b1'], 4) {'a1': 86.5, 'd0': 77.0, 'd1': 63.0, 'c0': 42.5} model.predict_k_with_weights(['d1', 'b1', 'a0', 'a0'], 4) {'d1': 189.0, 'd0': 119.0, 'a0': 97.0, 'a1': 79.5} model.predict_k_with_weights(['b1', 'a0', 'a0', 'd1'], 4) {'d1': 167.5, 'd0': 131.0, 'a1': 112.0, 'a0': 107.5} model.predict_k_with_weights(['d1', 'a0', 'c0', 'd1'], 4) {'a1': 343.0, 'a0': 267.5, 'd0': 238.0, 'd1': 219.5} model.predict_k_with_weights(['a1', 'd0', 'b1', 'd0'], 4) {'a1': 260.0, 'a0': 198.5, 'd0': 153.0, 'd1': 146.5} model.predict_k_with_weights(['d0', 'c1', 'b1', 'c0', 'c1'], 4) {'a1': 19.0, 'b1': 12.0, 'a0': 11.0, 'd1': 7.0}

The display of weights also works correctly. I think this issue can be closed. Thank you for your help, sir!

bluesheeptoken commented 1 year ago

Glad to hear that solves your issues :tada:

Sorry for the delay, and sorry for my bad reaction. I misunderstood your joke, I guess the written communication does not help.

Happy to help again in the future, don't hesitate to ping me on issues, I don't always receive an email for that.

Thanks for the star :)