chorowski-lab / hCPC

Implementation of multi-level Contrastive Predictive Coding (CPC) methods
MIT License
20 stars 3 forks source link

Word segmentation with mACPC? #1

Open JeromeNi opened 2 years ago

JeromeNi commented 2 years ago

Hello, I am trying to use this repo to replicate the word segmentation results here (https://arxiv.org/pdf/2110.15909.pdf). I have some questions about this:

  1. It seems that the mACPC model uses a simple cosine dissimilarity segmenter for both frame and segment levels. In the code, the prominence level of that segmenter is fixed to 0.05. Was that the prominence level used to train the mACPC for 2110.15909?
  2. Is there a code file for word segmentation? I tried to modify the segmentation.py code, but it seems that for some reason, a trained mACPC model would only output word boundaries that are very much concentrated in the first 20% of the input segment, and I cannot figure out what was wrong.

Thanks!

tiagoCuervo commented 2 years ago

Hi Jerome!

Let me try to address your questions:

  1. Yes, 0.05 was used for training
  2. This code was refactored for the new paper (https://arxiv.org/abs/2206.02211) and doesn't include the word segmentation eval, but there is such code here: https://github.com/chorowski-lab/CPC_audio/blob/speaker-normalized/cpc/eval/segmentation.py

Let me know if I can be of further help :)

JeromeNi commented 2 years ago

Hi, thanks for the pointer. I later found that when obtaining the raw frame boundaries, I forgot to account for the fact that each discovery segment x in the second level corresponds to a frame length of y. Accounting for that fact gives segmentation results that at least spread across the 128 frames of a 20480-length raw audio segment.

But unfortunately, I find that the word boundary precision and recall scores are only around 0.2 for the forced aligned dev set of LibriSpeech (https://github.com/CorentinJ/librispeech-alignments) as well as the buckeye test set, with an mACPC model trained for a total of 50 epochs on ls100. The boundary score was calculated somewhat differently as I read each audio individually (instead of using the data loader which reads in all the audio files and then cutting them into chunks), split it into a series of length-20480 chunks, and each chunk was fed into the model to obtain the boundaries. However, for each chunk, I only took the boundaries provided by the model and did not prepend and append each predicted boundary for the chunk by 0 and 128, respectively. In other words, I only obtained the "internal" boundaries of each chunk, concatenate them together and evaluate them against the ground truth ones (also without manually adding start-end boundaries and between-chunk boundaries).

I tried the following ways to obtain the second-level (between-segment) boundaries

  1. Using the segment encoder's output
  2. Using the autoregressor's output
  3. As in 'segmentation.py', obtain a one-step-in-the-future prediction from the autoregressor's output, and calculate the similarity with respect to the autoregressor's output.

I usually see (2) to perform better, but it still cannot achieve precision and recall scores anywhere above 0.25 for the internal boundaries.

Edit: The word segmentation code is here: https://gist.github.com/JeromeNi/13c316d50e73686e20ec97ed8bb382c1

I have modified the class MultiLevelModel(nn.Module) in models.py. Basically, I uncommented

segmentLens = compressMatrices.sum(-1)

and returned this variable so that I know how many frames each mean-reduced segment corresponds to.

tiagoCuervo commented 2 years ago

We didn't test it on LibriSpeech dev-set, but the results on Buckeye should agree. Some things that come to my mind that might be the reason for the discrepancy:

  1. We used longer chunks for the word-segmentation evaluations in order to have more words per sample. If I recall correctly they were 4 times as large, so 81920 (I'll double check that as soon as I can). This might affect the peak-detector.
  2. Data pre-processing. We trimmed non-speech events in Buckeye up to 20 ms, and we also removed non correctly labeled words. For instance, there were cases in which whole sentences had a single word label/boundary. I'll contact the person who did Buckeye's pre-processing, since as of now I don't have that code at hand.

I haven't looked at your code yet, but the procedure you describe sounds solid. I'll take a look as soon as I have some time