Open evamaxfield opened 3 years ago
Or, create an average cue block vector and an average discussion block vector, then do sequence alignment with the vectors and the distance calculation is cosine distance.
Get the average feature vector for intro cue, the average feature vector for outro cue, and create a vector for the minutes item itself to act as the discussion block instead of just averaging all other non cue vectors.
We may be able to fine tune a sentence transformers model: https://www.sbert.net/docs/training/overview.html#loss-functions
Build a classifier for if something is a cue block or if something is a discussion block.
Can test different size block sizes (moving window) from 1, 2, 3, 4, 5, 10, sentences etc. After finding the block size classifier that performs best, use the trained classifier to generate a signal that is
0
for discussion blocks and1
for cue blocks.So a transcript's generated sequence from the classifier may look something like the bottom sequence in the image.
(1, 0, 0, 0, 1, 1, 0, 0 ,0, 0, 1, 1, 0, 1)
The top sequence is created by assuming that there will always be an "intro cue", some discussion, and then an "outro cue". So generate this sequence as
(1, 0, 1) * M
whereM
is the number of minutes items. I.e. for three minutes items the generated sequence is(1, 0, 1, 1, 0, 1, 1, 0, 1)
.Finally perform dynamic time warping / sequence alignment on these two sequences to find best path.
Eval overal performance with PK / WindowDiff.