Grouping results by Macomber ID in the Incipit Search

elambrinaki commented 4 years ago

We want to have the search results grouped by story ID. Since only the first ten results are shown, one ID might crowd out all the alternatives (e.g., a story has ten incipits from different manuscripts, and all these incipits appear in the search results). With this feature, the search results will contain ten different IDs.

We expect that all incipits for one ID appear as subsets. Each such group of incipits with the same ID counts as one search result. So, on the search results page we will have ten different IDs (ten groups of incipits, one group per ID).

kmcelwee commented 4 years ago

Preliminary questions

What is an example of crowding out that we can test against?
How might we need to change the UI to accommodate this feature?
Additional params can be fed directly into the get_results function in scripts/server.py. Do we want to put in the additional parameters manually? Or do we want to go back to the parasolr package?

Solr's group function

Result Grouping | Apache Solr Reference Guide 6.6

Here's the proposed additional param: group=true&group.field=macomber_id_s So line 55 of server.py we would change to:

results = queryset.get_results(group=True,  group.field="macomber_id_s" )

Notes

A groupby function returns a different object architecture, but by including the param group.main=true you can flatten the return object. We need to address if/how the UI will change though before we decide whether or not to use this
Sorting between groups is calculated by the highest score within groups.
And Solr recommends looking into Collapsing and Expanding, depending on how we want to approach this.

rlskoeser commented 4 years ago

@kmcelwee thanks for these great notes and questions!

@elambrinaki @WendyLBelcher please provide a few examples of search text that show the problem documented in this issue.

@kmcelwee I had imagined grouping the results but still displaying multiple entries, e.g. as a nested list. @elambrinaki @WendyLBelcher for the searches where you get multiple matches for a single macomber id, do you still want to see multiple different versions of that macomber story? If so, how many would you want? Is it helpful to know how many versions matched?

rlskoeser commented 4 years ago

@kmcelwee I would like to add the grouping functionality to parasolr, but it could make sense to prototype without that first. I'd imagined a group_by method that takes a field and sets the group and group.field parameters, and then possibly a grouped response object that handles the different result structure — or maybe the collapsed/expanded structure depending on which route we go. FWIW, as you may remember, PPA uses Solr grouping (to different ends, to match volumes with pages) — it isn't using parasolr yet, so having group-by functionality in parasolr would make it easier to migrate. I think PPA actually uses the collapsing and expanding — IDK if that should infuence our decision here or not.

elambrinaki commented 4 years ago

@WendyLBelcher @rlskoeser @kmcelwee Thank you for working on this feature!

I imagine grouping by Macomber ID only if results appear together (as consecutive search results). One Macomber ID (say, 500) might have three very different story versions (say, X, Y, and Z). Assume a story with Macomber ID 500, version X, appears in five manuscripts in our database, and incipits are very similar. These five manuscripts appear together in the search results when testing ID 500 X from a new manuscript. At the same time, incipits of the same ID 500, but versions Y and Z are very different, and they will be much lower in the search results. For example, the search might produce (1) ID 500 manuscript 1 (the one that contains version X) (2) ID 500 manuscript 2 (the one that contains version X) (3) ID 500 manuscript 3 (the one that contains version X) (4) ID 500 manuscript 4 (the one that contains version X) (5) ID 500 manuscript 5 (the one that contains version X) (6) ID 100 (7) ID 101 (8) ID 102 (9) ID 103 (10) ID 500 (version Y)

I think we need to have the first five search results to appear as a nested list, but the 10th search result to be separate from them (despite having the same ID). (1) ID 500 manuscript 1 (the one that contains version X)

ID 500 manuscript 2 (the one that contains version X)
ID 500 manuscript 3 (the one that contains version X)
ID 500 manuscript 4 (the one that contains version X)
ID 500 manuscript 5 (the one that contains version X)

(2) ID 100 (3) ID 101 (4) ID 102 (5) ID 103 (6) ID 500 (version Y) (7) something else that was not visible before (8) (9) (10)

rlskoeser commented 4 years ago

@elambrinaki the grouping functionality needs an identifier to group things — when we discussed this feature before we'd discussed using the Macomber ID. The relevance will be the highest score within the group, as Kevin noted. Unless there is another field to group them on that will do what you're proposing, I'm not sure how we can do that. Is the "version X" of the incipit you mention exactly the same or are there still variations within the group?

elambrinaki commented 4 years ago

@rlskoeser @WendyLBelcher In my hypothetical example above, search results 1-5 (same ID, same version) and 10 (same ID, another version) should not appear as a group. It needs to be a cluster of results 1-5 and a distinct result 10. This would allow me to see whether the story I'm testing is recension X or recension Y.

So I think we need a variable Story ID+recension ID to implement the grouping feature. (we don't have it now).

Is the "version X" of the incipit you mention exactly the same or are there still variations within the group?

The incipit is not exactly the same; there are variations within the group.

WendyLBelcher commented 4 years ago

To be honest, I think the clustering might be tough. And, maybe not needed. I'm not sure. For instance, I did a search on the proper noun ደቅስዮስ. Two miracle stories have this name in it. So, it might be 13 or it might be 474. So, it's correct for the tool to NOT group the ID 13's together, but to drop them lower, exactly as it did. Screenshot search tool

elambrinaki commented 4 years ago

@WendyLBelcher Wendy, it seems that the "Exclude from ITool" field will solve the problem with overcrowding (all ten search results corresponding to the same ID but from diffrent manuscripts). Should we recall this feature request?

kmcelwee commented 4 years ago

@WendyLBelcher @elambrinaki Would you mind confirming whether or not this is an issue I should work on before talking on Thursday? And if so, would you mind confirming that this is in line with what you would like (I'll make sure that I'll only group if the version matches)

Screen Shot 2020-06-02 at 4 09 48 PM

elambrinaki commented 4 years ago

@kmcelwee What does it mean "if the version matches"? What version?

kmcelwee commented 4 years ago

@elambrinaki What I meant was I'll make sure to group by "Story ID+recension ID", not just Story id.

elambrinaki commented 4 years ago

@kmcelwee Aha, got it! I have an example that I want to discuss. I'll post it soon. Please wait for me before you begin working on this feature.

elambrinaki commented 4 years ago

@kmcelwee @rlskoeser I am wondering whether grouping results will make them more sensitive to mistakes.

Assume we have the following search results:

Regular view

The first record is a mistake (a cataloger mistyped 13). It is quite obvious in this view. So I will not assign ID 3 to the story I'm testing.

But if we group the results, they will look like Grouping Will it be as obvious for me as before that my ID is 13, not 3?

elambrinaki commented 4 years ago

Decided not to do it.

Princeton-CDH / pemm-scripts