Certain Remove_Silences will result in a longer audio clip than the original

kevinbird15 commented 5 years ago

https://github.com/mogwai/fastai_audio/blob/e8499b72876292b5013f2382a7eef7317aa674e8/audio/transform.py#L154-L167

I am still debugging this issue, but wanted to point it out in case anybody else has fixed it. Will update once I figure it out. I think the problem is the b+padding. I think it should just be b, but I am still trying to see what the +padding is supposed to do and see if there is another way to do it.

kevinbird15 commented 5 years ago

This is the code I'm investigating at the moment:

elif remove_type == "all": 
         return [torch.cat([actual[(max(a-padding,0)):(min(b+padding,len(actual)))] for (a, b) in splits])]

kevinbird15 commented 5 years ago

This appears to fix the issue, I'm not sure if it's the best way to do it, but I do think I understand the problem better than I did initially.

     elif remove_type == "all": 
         return [torch.cat([actual[int(max(a-padding/2,0)):int(min(b+padding/2,len(actual)))] for (a, b) in splits])]

The problem initially was that when using "all" it combines all of the individual clips together which includes the pad at the beginning of the clip and the pad at the end of the clip. The problem is if there are a lot of splits, this overlap can be up to a full Padding size. My fix proposal is to halve padding when combining all of the clips. This would make the splits touch at a max.

rbracco commented 5 years ago

Hey, thanks for the detailed bug report and PR. We have a telegram group with a few people working on the project, as well as a thread here: fastai audio thread where there's an active discussion, you seem like you'd be a great addition if you'd like to join either.

What about doing a function that takes the splits from librosa.effects.split and merges splits together if they are going to overlap in a way that will cause this problem. For example, with sr = 16000, padding at 100ms (1600 samples), splits of (0, 12000) and (14,000, 30000) would cause this problem because when padded they'd be (0, 13600) and (12400, 31600) so the silence from 12400-13600 would be run twice. We could instead just merge those two splits together to be (0, 30000).

Below is some (probably suboptimal) code to do that. So if the remove_silence type is 'all' we would just add splits = merge_splits(splits, padding). Let me know your thoughts and if you like the idea and have a more elegant way to do it. Thanks again for your contribution.

def merge_splits(splits, pad):
    clip_end = splits[-1][1]
    merged = []
    for i in range(len(splits)):
        start = splits[i][0]
        while splits[i][1] < clip_end and splits[i][1] + pad > splits[i+1][0] - pad:
            i += 1
        end = splits[i][1]
        merged.append(np.array([max(start-pad, 0), min(end+pad, end)]))
    return np.stack(merged)

kevinbird15 commented 5 years ago

Thanks @rbracco I am on the fastai forums as kevinb. I haven't had anything to contribute to the thread, but definitely am interested in helping. I don't currently have telegram, but if there are important discussions happening there, I could get an account to be in on discussions.

kevinbird15 commented 5 years ago

resolved by #42

fastaudio / fastai_audio

Certain Remove_Silences will result in a longer audio clip than the original #25