elynema commented 6 months ago

Description

Based on outcome from investigation in #5726, actually add transcript text to Solr so that it is searched as part of the Avalon repository search. Any media items with result(s) in the transcript should be returned in the search results.

First pass implementation could just index the entire transcript as one field. We may find we want to break the transcripts up and store them at the cue level in the index to better support IIIF content search if we decide to go that route.

Done Looks Like

[x] Media object is returned if there are any matches across any of the transcripts tied to any section
[x] Phrase searches entered with quotation marks are followed (only records with transcripts that have that phrase are returned as search results)
- [x] Captions marked as transcripts are also searched
- [x] Rake or migration task written for taking transcript documents (or captions marked as transcripts) and adding them to the index. This task should also support population of the Has Captions and Has Transcript facets, which will require re-indexing of the masterfile that owns the transcript/caption.

elynema commented 5 months ago

Still not able to get search results to return when hits are only in the transcript, not in the metadata or sections.

cjcolvar commented 5 months ago

I was able to solve the problem of getting search results when only the transcript matches the query. I came up with two differing approaches to solving this. As far as I've been able to figure out both require switching to the lucene parser from the edismax parser which might cause ranking of results to change.

Nested documents (requires a reindex to add _root_ and _nest_path_ fields): https://github.com/avalonmediasystem/avalon/commit/051c9318d665a96e9b628754827843830488b745

Independent documents (requires a DB migration for supplemental files): https://github.com/avalonmediasystem/avalon/commit/3106c9d0fffb47055b3dd5bddb5a0c51eccabc7f

cjcolvar commented 5 months ago

Might be able to still use edismax parser by doing something like:

({!edismax}Testing) OR {!join to=id from=isPartOf_ssim}{!join to=id from=isPartOf_ssim }transcript_tsim:IU OR transcript_tsim:East

elynema commented 5 months ago

Proof of concept dropped transcript text on masterfile - one big field for all transcripts for that section. That didn't allow correlation of hits to transcripts. Needed to index independently.

Options:

make each transcript its own Solr document and have a field that maintains the relationship (id) with a join query

Pros:
- doesn't require any changes to Solr
- should be able to index transcript documents only, not a full re-index

Cons:

requires new db field in supplemental files with parent ID, so this adds a small database migration. Would have to go back and backfill, similar to how we did for captions in 7.7.0 release.
complex Solr queries with multiple joins

nested documents. Transcript document is a child of the masterfile document. May make it easier to maintain the relationship between transcript and masterfile.

Pros:

simplifies Solr queries - don't have to do multiple joins, which are very gnarly to write correctly
Solr is building up support for child documents

Cons:

Solr needs new fields added to every document, root and nestpath. This would require a full re-index.
Nested documents might require Solr 8/9. We have not had a hard requirement for Solr 9 yet.

POC for both scenarios works for retrieving highlighting hits and term counts and getting search results where the only hit in the item is in the transcript. Chris thinks he has figured out how to workaround the parser issue so that we can continue to use edismax in both scenarios; user-entered query string wrapped in edismax.

Not yet sure of performance impacts of these approaches. Seems like impact to size of index should be similar. But not clear how query return time might be impacted as size of index grows. Should consider a future where everything has a transcript.

Requiring a Solr upgrade and another reindex is not particularly desirable, although it sounds like option 2 might be the better long term solution. Option 2 is supposed to be performant, we're not totally sure about option 1. Recommend proceeding with option 1 as a prototype and testing performance against mco-staging dataset.

elynema commented 3 months ago

@cjcolvar I think I found an example where this doesn't seem to be working correctly.

See record: https://avalon-dev.dlib.indiana.edu/media_objects/g158bh28p

When I do a repository search for 'california' this record is the first result, even though it shows no hits in the metadata, transcript, or sections. When I search within the first transcript, I get no hits, even though the word 'California' is clearly in that transcript and it is a .vtt file. Any idea what might be going on? Could the weird file name for the transcript file be causing issues? Looks like the file name is displaying improperly in the transcript component.

cjcolvar commented 3 months ago

@elynema When I test that item, I'm also seeing it appear in search results with no hits in metadata/transcript/sections. Searching in the first transcript, I also get no hits, but the content search JSON response from the backend does have the one expected hit. So I think it is being indexed correctly, but there may be a bug in search result hit counting. In ramp, there are some changes around fixing hit counts and result finding in https://github.com/samvera-labs/ramp/pull/532 that will be merged soon.

elynema commented 3 months ago

@cjcolvar Why aren't we seeing any hits in transcript on the Blacklight side, though?

cjcolvar commented 3 months ago

I'm not sure. I'll look into it.

cjcolvar commented 3 months ago

I found the issue. In the subqueries we missed raising the row limit above the default 10. For that item the first section wasn't getting returned so it's hit wasn't included in the count. I'll make a PR for this fix.

cjcolvar commented 3 months ago

We probably also need to test that content search work for more than 10 transcripts, but that is probably an edge case right now.

joncameron commented 2 months ago

👍

avalonmediasystem / avalon

Add transcript text to Solr #5730

Description

Done Looks Like