How do you construct the training pairs? It looks like you construct the training pairs in the SegmentMixer class. Do you use the same minibatch sources to construct the "mixture" and "target audio" pairs?
There might be one issue:
source1: male speech , s1
source2: another male speech, s2
if mixture = s1 + s2:
as both captions are "male speech", will it confuse the model training?
Nice work!
How do you construct the training pairs? It looks like you construct the training pairs in the SegmentMixer class. Do you use the same minibatch sources to construct the "mixture" and "target audio" pairs?
There might be one issue: source1: male speech , s1 source2: another male speech, s2 if mixture = s1 + s2: as both captions are "male speech", will it confuse the model training?