avalonmediasystem / avalon

Avalon Media System – Samvera Application
http://www.avalonmediasystem.org/
Apache License 2.0
93 stars 51 forks source link

[BUG] Content search error when vtt timestamps not properly formed #5886

Open elynema opened 2 weeks ago

elynema commented 2 weeks ago

Describe the bug

In this particular example, when I submit a search within a caption marked to treat as a transcript with the term 'galaxy' I'm getting a 500 response from the search service with a parsing error. I don't seem to get the same error when I search for the term 'galaxy' in other records, which don't necessarily have any hits. I'm not sure if there is a problem with the search term, or if it is related to marking a transcript to treat as a caption.

image.png

To Reproduce Steps to reproduce the behavior, including the results:

  1. Go to record: https://avalon-dev.dlib.indiana.edu/media_objects/fq977t794
  2. Make sure that the caption for the third section (amp-demo-amia-2023) is marked to treat as transcript
  3. Load the media object page, select the third section, and type 'galaxy' into the search box
  4. 500 error is returned (see screenshot above)

Expected behavior This caption does have the text 'galaxy' so you should get at least one response from the search service.

Environment (please complete the following information):

Release: 7.8

Done Looks Like:

elynema commented 2 weeks ago

Just updated this ticket as I'm seeing a similar result when I search for another word that is found in the caption ('workflow'). I don't seem to get an error if I search for a word not found in the caption.

elynema commented 1 week ago

Failing on this line of code because previous line is performing a regex match. That regex match is not getting any hits.

masaball commented 1 week ago

I looked at the caption file and it is malformed VTT. The cues in the file are of the format h:mm:ss.ttt while the spec defines cues as hh:mm:ss.ttt. Because it is what the spec says, the regex being matched against looks for a minimum of 2 digits in the hours place if there are hours present, so that seems to be why this file is failing.

Probably easiest way forward would be to add time cue normalization into the transcript indexing code. We do time cue normalization when converting SRT to VTT, so we should just need to copy and slightly adjust that handling.

We may also want to add better error handling to the search function but maybe that should be a separate ticket?

elynema commented 1 week ago

@masaball Were you able to get this to work? I updated that vtt file so that it has 00 for the hour in each line and added it back as a caption, and I'm still having the same issue with content search.

Also note that the vtt spec does not require the hours digits if the time is <60 minutes. Does this code accommodate that?

masaball commented 1 week ago

The regex does treat the hours place as optional. And I had only looked at the VTT and seen it was malformed, I had not looked deeper at it or done any experimenting yet. But I'll take a look at what is going on because thats weird that fixing the formatting had no effect.

masaball commented 1 week ago

@elynema I downloaded the new file and uploaded it locally and it worked on my local instance. I then ran the solr search query directly through the Avalon-dev console and it looks like what is happening is that the original chunked file is still in the index which is causing the error to still trigger. Looks like we need to add in handling to remove transcripts from the index if they are deleted.

elynema commented 1 week ago

@masaball Want me to create a new ticket for removing transcripts from the index if deleted?

masaball commented 1 week ago

Yes, thank you. That makes sense to have as its own issue.

elynema commented 1 week ago

When we upload caption vtts, we only check that it is a vtt file; for transcripts we don't do any validation. The transcript component validates the vtt and will show message if invalid.

Future work to pursue might be to validate these file formats once on upload and let user know if there is a problem.

elynema commented 1 week ago

Ramp #537 will update validation in Ramp so that transcripts with timestamps like this will not display to user. Enhance regex to be less strict so that users searching via content search can still search this file.