Closed aodonnell closed 5 years ago
Looking into it a bit, it seems we can eliminate more that 3/4 of our frequency bins and still preserve enough meaningful information about the track vocals. This is HUGE since that significantly reduces the input dimensionality of our BiRNN Encoder 😄 -> 🕶 -> 😎 . I've already implemented this in https://github.com/dawg/models/tree/feature/separation-optimizers so we can close this issue once it is merged.
Expected Behaviour
We need to make sure that we can load data quick enough to avoid a bottleneck. This needs to be done ASAP since I would like us to start training next week
Current Behaviour
With preservation of both phase and magnitude information, our samples are now HUGE. This means that loading and unloading a single sample is an expensive task.
Suggested Fix
Someone could research if it's common to split an audio file into multiple smaller samples or if there are other industry techniques to throttle frequency information. Another thing we can consider is ditching some frequency. Looking at the logarithmic spectral density below, the majority of the signal is contained within roughly half of the spectrum we compute the stft for. This will sacrifice some timbre information contained in those upper harmonics but I don't really think it will affect the overall sound enough to make it noticeable.