Open JeromeNi opened 2 years ago
Hi Jerome!
Let me try to address your questions:
Let me know if I can be of further help :)
Hi, thanks for the pointer. I later found that when obtaining the raw frame boundaries, I forgot to account for the fact that each discovery segment x in the second level corresponds to a frame length of y. Accounting for that fact gives segmentation results that at least spread across the 128 frames of a 20480-length raw audio segment.
But unfortunately, I find that the word boundary precision and recall scores are only around 0.2 for the forced aligned dev set of LibriSpeech (https://github.com/CorentinJ/librispeech-alignments) as well as the buckeye test set, with an mACPC model trained for a total of 50 epochs on ls100. The boundary score was calculated somewhat differently as I read each audio individually (instead of using the data loader which reads in all the audio files and then cutting them into chunks), split it into a series of length-20480 chunks, and each chunk was fed into the model to obtain the boundaries. However, for each chunk, I only took the boundaries provided by the model and did not prepend and append each predicted boundary for the chunk by 0 and 128, respectively. In other words, I only obtained the "internal" boundaries of each chunk, concatenate them together and evaluate them against the ground truth ones (also without manually adding start-end boundaries and between-chunk boundaries).
I tried the following ways to obtain the second-level (between-segment) boundaries
I usually see (2) to perform better, but it still cannot achieve precision and recall scores anywhere above 0.25 for the internal boundaries.
Edit: The word segmentation code is here: https://gist.github.com/JeromeNi/13c316d50e73686e20ec97ed8bb382c1
I have modified the class MultiLevelModel(nn.Module) in models.py. Basically, I uncommented
segmentLens = compressMatrices.sum(-1)
and returned this variable so that I know how many frames each mean-reduced segment corresponds to.
We didn't test it on LibriSpeech dev-set, but the results on Buckeye should agree. Some things that come to my mind that might be the reason for the discrepancy:
I haven't looked at your code yet, but the procedure you describe sounds solid. I'll take a look as soon as I have some time
Hello, I am trying to use this repo to replicate the word segmentation results here (https://arxiv.org/pdf/2110.15909.pdf). I have some questions about this:
Thanks!