Open elynema opened 2 weeks ago
Just updated this ticket as I'm seeing a similar result when I search for another word that is found in the caption ('workflow'). I don't seem to get an error if I search for a word not found in the caption.
Failing on this line of code because previous line is performing a regex match. That regex match is not getting any hits.
I looked at the caption file and it is malformed VTT. The cues in the file are of the format h:mm:ss.ttt
while the spec defines cues as hh:mm:ss.ttt
. Because it is what the spec says, the regex being matched against looks for a minimum of 2 digits in the hours place if there are hours present, so that seems to be why this file is failing.
Probably easiest way forward would be to add time cue normalization into the transcript indexing code. We do time cue normalization when converting SRT to VTT, so we should just need to copy and slightly adjust that handling.
We may also want to add better error handling to the search function but maybe that should be a separate ticket?
@masaball Were you able to get this to work? I updated that vtt file so that it has 00 for the hour in each line and added it back as a caption, and I'm still having the same issue with content search.
Also note that the vtt spec does not require the hours digits if the time is <60 minutes. Does this code accommodate that?
The regex does treat the hours place as optional. And I had only looked at the VTT and seen it was malformed, I had not looked deeper at it or done any experimenting yet. But I'll take a look at what is going on because thats weird that fixing the formatting had no effect.
@elynema I downloaded the new file and uploaded it locally and it worked on my local instance. I then ran the solr search query directly through the Avalon-dev console and it looks like what is happening is that the original chunked file is still in the index which is causing the error to still trigger. Looks like we need to add in handling to remove transcripts from the index if they are deleted.
@masaball Want me to create a new ticket for removing transcripts from the index if deleted?
Yes, thank you. That makes sense to have as its own issue.
When we upload caption vtts, we only check that it is a vtt file; for transcripts we don't do any validation. The transcript component validates the vtt and will show message if invalid.
Future work to pursue might be to validate these file formats once on upload and let user know if there is a problem.
Ramp #537 will update validation in Ramp so that transcripts with timestamps like this will not display to user. Enhance regex to be less strict so that users searching via content search can still search this file.
Describe the bug
In this particular example, when I submit a search within a caption marked to treat as a transcript with the term 'galaxy' I'm getting a 500 response from the search service with a parsing error. I don't seem to get the same error when I search for the term 'galaxy' in other records, which don't necessarily have any hits. I'm not sure if there is a problem with the search term, or if it is related to marking a transcript to treat as a caption.
To Reproduce Steps to reproduce the behavior, including the results:
Expected behavior This caption does have the text 'galaxy' so you should get at least one response from the search service.
Environment (please complete the following information):
Release: 7.8
Done Looks Like: