Open lamipaul opened 5 months ago
Hi! Thanks a lot for your interest in our work! Yeah I get your point, in my specific case I only worked on long enough files with a sample rate of 16kHz, so the encoder input dimension is not completely fixed but determined automatically as the maximal size for this setup, depending on the number of bins per semitone you're using. Tbh I'm not sure if there is a completely general formula that one could infer directly from the other parameters of the config, because I guess it would depend among others on the sampling rate, which I want to keep infered dynamically. Maybe I'm missing sth though.
Otherwise I thought about instantiate the layers lazily but it's not super well supported so I gave up on this.
Hi aRI0U,
I understand there is no easy way to infer the encoder input dimension, I came to the same realisation while thinking on how to fix it. No worries, it's not too much pain to wait for the first bug and fix the config manually.
I thought I'd take the opportunity of this thread to report other small bugs I encountered when using my model trained with this repo to run predictions with the "main" one
model.py L.141 : I had to add an extra dimension with x.unsqueeze(1)
before the layernorm
loader.py L.47 : hparam[‘reduction’]
is not defined in saved weights (I replaced by “awa”
)
core.py L.31 : I added a squeeze of the first remaining dim of size 1 so predictions have the same shape as timesteps
core.py L.36 : I added a division by 1000 to return timesteps in seconds as the hop_size is in miliseconds
I sincerely apologise if any or all of this is irrelevant, thanks again for the great work !
Hi, and thanks a lot for this really cool repo!
I am trying to train pesto on my own data, and I had to change fmin and fmax (having relatively short audio files, otherwise CQT kernels would not fit). As a result, my CQT frames have a different size than expected (157 and 125 before and after cropping respectively, instead of 216). This results in a crash when entering the layernorm of the Resnet encoder, because its input dimension seems fixed. I could bypass this by manually setting
model.encoder.n_bins_in=125
, but I guess it would be nice to have it adapt automatically.Really sorry if I understood something wrong and this is not a bug.