Closed localcreator closed 1 year ago
Hey @localcreator Sorry for the late response, I don't get notifications and I don't know why...
What would be the probability here? The % of each "word" in the "corpus"? Or maybe you meant the probability of a prediction given a sequence?
Hi! Simple probability in %, just as it is implemented in sklearn. This fits better: "Or maybe you meant the probability of a prediction given a sequence?"
I am not familiar with probabilities in sklearn. Are they relative probabilities with the different outcomes? Or are they absolute probabilities?
For instance, if the algorithm sees the two outcome A and B. And A is 3 times more likely, does it return A: 75%, B: 25%
or more something like A: 6%, B: 2%, Others: 92%
?
Currently, none of them are actually implemented. But I guess we could easily do something about the relative probabilities as an indicator and not an absolute truth
This fits better: "For instance, if the algorithm sees the two outcomes A and B. And A is 3 times more likely, does it return A: 75%, B: 25% or more something like A: 6%, B: 2%, Others: 92%?" model.probability() simply prints these probabilities for each prediction. The result for the next forecast A, B, C, D may look like [0.7, 0.1, 0.05, 0.15]. And here it will be clear that the most likely forecast will be A[0.7]
It would make sense to have them to see how more likely a prediction is compared to another.
I think we could compute this from the calculated weights from FrequencyArray
. And this is what get_k_best_letter
does to get the predictions.
If you would like to give it a shot, I would be happy to help. Otherwise, I will try to schedule this on my side, but this won't be done this week unfortunately
Unfortunately, I have little experience editing someone else's code. So far, the only hope is that someday you will have time to do this
Alright :+1:
Thanks for opening the issue, that's a good idea.
I will try to get time free soon :)
Hi! Can I find out if there is any progress in solving this issue?
Hello, I was going to give my pet projects a bit of love next week. I was also going to ask you if you still needs it. Apparently yes, I will try to release next version by next week :)
Yes, of course, I have been waiting with great desire and still continue to wait for this update for more than a year! Somewhere I have already heard about one week, it seems it was last year)))
Hello, please kindly note that this is an unsponsored open-source project. I am working on this in my free time and take no financial benefits from it. To be honest, there is no technical interest on my side to make this improvement, it was only to help you. Now, it's not even sure I will prioritize this.
I understand, sorry for the unfortunate joke, sir. In any case, regardless of your desire to continue working on this project, I am sincerely grateful for your work done!
Hello, I am having a look at this.
Would the weights be enough for your use case? Something like {'A': 0.04, 'B': 0.01}
would be enough?
And so on client side you could compute {'A': 80%, 'B': 20%}
.
I could also display the total_weights
, but I am afraid the total_weights
would confuse more than it would help in this kind of case.
What do you think?
I try to not take any mathematic liberties on the research paper I don't own. And I feel like showing percentages such as {'A': 0.75}
would be a bit misleading, people would think subseq
is sure at 75% that A
is the prediction, which is wrong.
I have drafted a small PR: https://github.com/bluesheeptoken/subseq/pull/20/files#diff-9e237807a107b7849785e2833d8866c65e7ac4c052836d17d08ddef7ea0558a7R28-R37
Would something like this fit well your needs? For instance, in the test linked, C
got a weight of 8 and B
a weight of 4. This is because C is present in 2 sentences, and B in only 1. (Note that these are not the only factor that impacts prediction, distance from the query also impact. TL;DR If the text contains ADC and AB, B will have a higher weight than C. Even if it is present in one sentence each)
I don't know, it's worth a try. It is not yet clear why C is present in 2 sentences, but the weight is 8, and not, for example, 2 or 4. Here it is at your discretion, as you consider more correct.
Because internal weights is a bit more subtile than just counting the number of sentence in which you can find the letter. Full code for reference is there. But maybe the research paper is clearer.
If that's okay for you, I will release this version tomorrow with a version for Python 3.11
Thanks! Yes, of course, that would be great!
That's published under version 1.0.4 ! I also released for python 3.11, you shouldn't have issues anymore.
Can you test and tell me if this is fine for your use case? If you enjoy the project, don't hesitate to star the project :)
model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 3) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5} model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 4) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5, 'a0': 38.0} model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 5) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5, 'a0': 38.0, 'c0': 19.0} model.predict_k_with_weights(['a0', 'd1', 'a1', 'd0', 'b1'], 4) {'a1': 60.5, 'd0': 56.0, 'd1': 44.5, 'a0': 38.0} model.predict_k_with_weights(['d1', 'd0', 'c1', 'b1'], 4) {'a1': 86.5, 'd0': 77.0, 'd1': 63.0, 'c0': 42.5} model.predict_k_with_weights(['d1', 'd0', 'c1'], 4) {'a0': 617.0, 'a1': 612.5, 'd1': 534.0, 'd0': 511.5} model.predict_k_with_weights(['d1', 'd0', 'c1', 'b1'], 4) {'a1': 86.5, 'd0': 77.0, 'd1': 63.0, 'c0': 42.5} model.predict_k_with_weights(['d1', 'b1', 'a0', 'a0'], 4) {'d1': 189.0, 'd0': 119.0, 'a0': 97.0, 'a1': 79.5} model.predict_k_with_weights(['b1', 'a0', 'a0', 'd1'], 4) {'d1': 167.5, 'd0': 131.0, 'a1': 112.0, 'a0': 107.5} model.predict_k_with_weights(['d1', 'a0', 'c0', 'd1'], 4) {'a1': 343.0, 'a0': 267.5, 'd0': 238.0, 'd1': 219.5} model.predict_k_with_weights(['a1', 'd0', 'b1', 'd0'], 4) {'a1': 260.0, 'a0': 198.5, 'd0': 153.0, 'd1': 146.5} model.predict_k_with_weights(['d0', 'c1', 'b1', 'c0', 'c1'], 4) {'a1': 19.0, 'b1': 12.0, 'a0': 11.0, 'd1': 7.0}
The display of weights also works correctly. I think this issue can be closed. Thank you for your help, sir!
Glad to hear that solves your issues :tada:
Sorry for the delay, and sorry for my bad reaction. I misunderstood your joke, I guess the written communication does not help.
Happy to help again in the future, don't hesitate to ping me on issues, I don't always receive an email for that.
Thanks for the star :)
Hi! Is there any way I can see the probability? Something like:
model = Subseq(1) model.probability()