alerque / stack-verse-mapper

Index Bible verse references in Stack Exchange data dumps.
https://alerque.github.io/stack-verse-mapper
GNU Lesser General Public License v3.0
7 stars 0 forks source link

Support searching for multiple references #21

Open curiousdannii opened 8 years ago

curiousdannii commented 8 years ago

@scottgit wrote:

How does this algorithm handle multiple references? So lets say a person put the Romans 7:3-6 and Mat 5:32 in a single query (some study on adultery). Does it rank each FPS and then add them together? Or average the FPS?

@curiousdannii wrote

It doesn't currently, multiple references was in the too hard basket. But now that I think more about it, it might not be too hard.

The way the program works now is that for each post, it filters the references to identify the ones which overlap with the query, and then calculates their specificity. We could make it check for either passage, and then continue as before.

So for a post which referenced Romans 7 and Matt 5, those two references would be identified, and their individual specificities calculated as 10.5 and 23.5. SPEC would be 10.5 for the Romans 7 passage, while TDRHP would be 2. (TDRHS calculated similarly.) For the tags, would it be worth awarding the bonus for each matching tag? Probably.

Hmm, but that doesn't give any way to distinguish between a post with Rom 7:3-6 and Matt 5:32 vs Romans 7:3-6 and Matt 5 - both would be given a SPEC of 0. I think calculating two full FPS and then averaging them would be hard to program, but we could easily average the SPEC scores for each query reference.

@scottgit wrote

I actually think that instead of averaging (which penalizes any exact matches of one of the references), when multiple references are given we should calculate the full FPS of each and then add them together.

In this way, any posts with both passages referenced in some way will likely end up higher than 220, while any post with an exact match of one but no match at all of the other will still fall in a "normal" range of 220-3. Theoretically then, if 2 ranges are queried, the FPS range would be 440-3; then if 3 ranges are queried, the range would expand to 660-3, etc.

Would that be too difficult? It allows for "infinite" expansion, yet will always be ranking higher the results that have more passages matching, and then the better the match for each passage, the higher as well.

@curiousdannii wrote

Were you thinking of the multi reference searches being AND or OR? Adding makes more sense if they are OR, but if we limit the results to posts which match both references then there wouldn't be any functional difference between adding and averaging. I had been assuming we would make it AND, which is the default search conjunction for SE (and Google and most others.) But OR might be useful too.

Multi reference will be harder to implement, and I'm not planning on doing that till the other things have progressed more. It might be better to make a new issue for them.

@scottgit wrote

Hmm, I'm not sure I agree in your thoughts. The adding of them gives you both AND and OR in one query. The AND's would be ranked higher (likely over the 220, 440, 660 thresholds), but the OR's would be in the list lower, since they are effectively picking one less reference to add in. So a query of 2 refs that only matches one of them could at best get a 220 point FPS, which is competing against those that do match both refs that, if it has at least one good match, would be in the 300+ range.

EDIT: Clarification I realize a query that returns any results in which one of the references is missing is technically just an OR query (as a true AND only query would be as you noted and "limit the results to posts which match both references"). But what I am proposing is different from what I would perceive as a pure OR query as well, because with a pure OR, an exact match of either the Romans ref. OR the Matthew ref. would potentially end up with the exact same FPS value, even if both references were found in the Post.

The following table illustrates the differences as I see it:

╔═══════════════════════════════════╦══════════════╦═══════════╦══════════════╗
║                                   ║   Pure AND   ║  Pure OR  ║ AND/OR w/Add ║
╠═══════════════════════════════════╬══════════════╬═══════════╬══════════════╣
║ One Post w/FPS 220 for Mt 5:32    ║ not returned ║ rank 220  ║ rank 220     ║
║ One Post w/FPS 220 for Rom 7:3-6  ║ not returned ║ rank 220  ║ rank 220     ║
║ One Post w/FPS 220 for both refs. ║ rank 220     ║ rank 220* ║ rank 440     ║
╚═══════════════════════════════════╩══════════════╩═══════════╩══════════════╝

* Assuming we are returning the highest of the two values to represent the Post, in this case they are of equal value.

So as a user, I would find the most value in the Addition where Posts with both the references I input returned a higher ranking because both refs. were found in the post, yet I am still "aware" of the Posts with only one reference, because they are lower in the search results (if I want to dig that far).

Am I understanding things correctly in my thought process here?

curiousdannii commented 8 years ago

You've convinced me, let's add the FPS for each reference.