avalonmediasystem / avalon

Avalon Media System – Samvera Application
http://www.avalonmediasystem.org/
Apache License 2.0
93 stars 51 forks source link

Add transcript text to Solr #5730

Closed elynema closed 2 months ago

elynema commented 6 months ago

Description

Based on outcome from investigation in #5726, actually add transcript text to Solr so that it is searched as part of the Avalon repository search. Any media items with result(s) in the transcript should be returned in the search results.

First pass implementation could just index the entire transcript as one field. We may find we want to break the transcripts up and store them at the cue level in the index to better support IIIF content search if we decide to go that route.

Done Looks Like

elynema commented 5 months ago

Still not able to get search results to return when hits are only in the transcript, not in the metadata or sections.

cjcolvar commented 5 months ago

I was able to solve the problem of getting search results when only the transcript matches the query. I came up with two differing approaches to solving this. As far as I've been able to figure out both require switching to the lucene parser from the edismax parser which might cause ranking of results to change.

Nested documents (requires a reindex to add _root_ and _nest_path_ fields): https://github.com/avalonmediasystem/avalon/commit/051c9318d665a96e9b628754827843830488b745

Independent documents (requires a DB migration for supplemental files): https://github.com/avalonmediasystem/avalon/commit/3106c9d0fffb47055b3dd5bddb5a0c51eccabc7f

cjcolvar commented 5 months ago

Might be able to still use edismax parser by doing something like:

({!edismax}Testing) OR {!join to=id from=isPartOf_ssim}{!join to=id from=isPartOf_ssim }transcript_tsim:IU OR transcript_tsim:East
elynema commented 5 months ago

Proof of concept dropped transcript text on masterfile - one big field for all transcripts for that section. That didn't allow correlation of hits to transcripts. Needed to index independently.

Options:

  1. make each transcript its own Solr document and have a field that maintains the relationship (id) with a join query

    Pros:

    • doesn't require any changes to Solr
    • should be able to index transcript documents only, not a full re-index

Cons:

  1. nested documents. Transcript document is a child of the masterfile document. May make it easier to maintain the relationship between transcript and masterfile.

Pros:

Cons:

POC for both scenarios works for retrieving highlighting hits and term counts and getting search results where the only hit in the item is in the transcript. Chris thinks he has figured out how to workaround the parser issue so that we can continue to use edismax in both scenarios; user-entered query string wrapped in edismax.

Not yet sure of performance impacts of these approaches. Seems like impact to size of index should be similar. But not clear how query return time might be impacted as size of index grows. Should consider a future where everything has a transcript.

Requiring a Solr upgrade and another reindex is not particularly desirable, although it sounds like option 2 might be the better long term solution. Option 2 is supposed to be performant, we're not totally sure about option 1. Recommend proceeding with option 1 as a prototype and testing performance against mco-staging dataset.

elynema commented 3 months ago

@cjcolvar I think I found an example where this doesn't seem to be working correctly.

See record: https://avalon-dev.dlib.indiana.edu/media_objects/g158bh28p

When I do a repository search for 'california' this record is the first result, even though it shows no hits in the metadata, transcript, or sections. When I search within the first transcript, I get no hits, even though the word 'California' is clearly in that transcript and it is a .vtt file. Any idea what might be going on? Could the weird file name for the transcript file be causing issues? Looks like the file name is displaying improperly in the transcript component.

cjcolvar commented 3 months ago

@elynema When I test that item, I'm also seeing it appear in search results with no hits in metadata/transcript/sections. Searching in the first transcript, I also get no hits, but the content search JSON response from the backend does have the one expected hit. So I think it is being indexed correctly, but there may be a bug in search result hit counting. In ramp, there are some changes around fixing hit counts and result finding in https://github.com/samvera-labs/ramp/pull/532 that will be merged soon.

elynema commented 3 months ago

@cjcolvar Why aren't we seeing any hits in transcript on the Blacklight side, though?

cjcolvar commented 3 months ago

I'm not sure. I'll look into it.

cjcolvar commented 3 months ago

I found the issue. In the subqueries we missed raising the row limit above the default 10. For that item the first section wasn't getting returned so it's hit wasn't included in the count. I'll make a PR for this fix.

cjcolvar commented 3 months ago

We probably also need to test that content search work for more than 10 transcripts, but that is probably an edge case right now.

joncameron commented 2 months ago

👍