LongFormDataset takes a single track and a single transcription as input. It is intended for long audio tracks which have to be cut into pieces before we can process them with a transformer model, since these attention is quadratic in it's input length.
Why
Processing of long form audio will require us to cut the audio into smaller chunks that we can pass to the forward pass for inference. We will probably want single utterances in each chunk, reduced to probably no more than 10 seconds. Each utterance should probably have some silence around it in the original.
Here's the plan. We will process the data in two passes. In the first pass, we will cut the audio into uniformly long overlapping windows. We will store the resulting logits for each batch. Once we have completed our first pass, we will patch together all of the segments into a single logical result for the entire track. Some things to note: Wav2Vec has to fill the context windows for the conv stacks, and the overlap of the windows should be exactly long enough so we can throw away these artefacts. The result should be exactly the same as passing the entire track through the model, if this were possible.
Next, we will segment the transcript into sentences. Here we assume that sentences will be short enough, and that transcripts include meaningful punctuation. We will now argmax the logits to obtain a target on the audio data. We will use a shingling method to match sentences to the audio target data. We now have a candidate region for each sentence.
Now we will do a second pass, which runs the actual alignment algorithm for each (sentence, candidate) pair. Instead or doing the forward pass again, we will read the precomputed logits from disk again. This way, we will only do a single forward pass over all of the data.
LongFormDataset is a piece we will need to make this work. It will be used to
a) Cut a long track into pieces for inference
b) Store the (sentence, candidate) data as a dataset so we can run alignment on it
Acceptance Criteria
[ ] It's possible to construct a LongFormDataset with a 30 minute track and a long transcript
[ ] It's possible to pass in a cut-map of [id, (from_audio, to_audio), (from_transcript, to_transcript)] tuples
[ ] Iterating through the dataset will yield slices as defined in the cut-map
[ ] It's possible to store the dataset along with a cut-map
[ ] It's possible to read the dataset from disk along with it's cut-map
What
LongFormDataset takes a single track and a single transcription as input. It is intended for long audio tracks which have to be cut into pieces before we can process them with a transformer model, since these attention is quadratic in it's input length.
Why
Processing of long form audio will require us to cut the audio into smaller chunks that we can pass to the forward pass for inference. We will probably want single utterances in each chunk, reduced to probably no more than 10 seconds. Each utterance should probably have some silence around it in the original.
Here's the plan. We will process the data in two passes. In the first pass, we will cut the audio into uniformly long overlapping windows. We will store the resulting logits for each batch. Once we have completed our first pass, we will patch together all of the segments into a single logical result for the entire track. Some things to note: Wav2Vec has to fill the context windows for the conv stacks, and the overlap of the windows should be exactly long enough so we can throw away these artefacts. The result should be exactly the same as passing the entire track through the model, if this were possible.
Next, we will segment the transcript into sentences. Here we assume that sentences will be short enough, and that transcripts include meaningful punctuation. We will now argmax the logits to obtain a target on the audio data. We will use a shingling method to match sentences to the audio target data. We now have a candidate region for each sentence.
Now we will do a second pass, which runs the actual alignment algorithm for each (sentence, candidate) pair. Instead or doing the forward pass again, we will read the precomputed logits from disk again. This way, we will only do a single forward pass over all of the data.
LongFormDataset is a piece we will need to make this work. It will be used to
a) Cut a long track into pieces for inference b) Store the (sentence, candidate) data as a dataset so we can run alignment on it
Acceptance Criteria