chrisdonahue / ddc_onset

Music onset detector from Dance Dance Convolution packaged as a lightweight PyTorch module
MIT License
31 stars 4 forks source link

Questions about the training procedure used - and if this actually reproduces the 2017 paper. #1

Closed stet-stet closed 1 year ago

stet-stet commented 1 year ago

Hello! Thank you for posting the model and the code.

**I write to inquire the following.

Here is a context on why I ask this, just for your information.

I have been studying a way to improve upon DDC, using an alternate formulation to eliminate the binary class imbalance discussed in your paper. I am almost done with a viable demo & evaluations, so I am looking to submit my work to ISMIR LBD if possible. I have gathered another dataset to train and eval on - this set is the only thing that enables my model to actually converge. Up till now, I have finetuned this pretrained model on Fraxtil/ITG, and then compared the F1 metrics with the numbers reported on your paper. While this does give superior metrics, I do still wish to ascertain if this is entirely because of me using more data in pretraining, or if my alternate problem formulation actually did help a bit.

Now, until yesterday I had no way to investigate this, since your past code shows CUDA or CUBLAS-related errors in every piece of hardware I could lay my hands on, and with how I failed to troubleshoot this after a few days of attempt. I was daunted by the prospect of having to re-implement and validate the whole model and the eval pipeline in TensorFlow 0 - a framework whose documentation has seemingly has been scrubbed off the face of Earth by its makers.

However, this repo opens up quite a range of possibilities for me. Now I can actually train on a piece of code you, the author, wrote and possibly validated, which also apparently does the same thing as the code published back in 2017. This is why I am led to ask the questions above.

I would be deligted to hear back from you. Thank you!

chrisdonahue commented 1 year ago

Thanks for the questions.

  1. This is a direct port of one of the pre-trained step placement models from DDC (specifically, these weights). It is either the 3rd or 7th row of Table 2 in the paper. I'm fairly confident it's the 3rd row (CNN / Fraxtil), though it's possible it was trained on ITG - I've lost track over the years but could probably figure out which it is if this is important. I recycled the pretrained model for Beat Sage as it worked surprisingly well. I went to considerable lengths to ensure that this ported model produces results that are identical (within some epsilon of numerical precision) to the reference TF1.0 implementation.
  2. No difference. This is the same model. I simply ported the essentia spectrogram preprocessing to PyTorch for convenience. My functional test checks against outputs from essentia for correctness
  3. I did not port the C-LSTM. The original codebase is still the right reference for that. For Beat Sage, the ever-so-slightly higher performance of the C-LSTM was not worth the additional implementation complexity / compute overhead

Excited to hear you're iterating on this research direction! Happy to provide additional info to aid our investigation.

Re: more data vs. new formulation, why can't you just train your model from scratch on Fraxtil / ITG and compare performance w/ your fine tuned model? Then, you would compare: (1) my model trained from scratch, (2) your proposed model trained from scratch on the same data, and (3) your proposed model pre-trained and then fine tuned on the same data. Shouldn't this properly ablate the effects of pre-training with your new formulation?

Unfortunately, I never bothered porting the training code to PyTorch. The TF 1.0 codebase is still the right reference for that. At this point it probably would only work in Docker.

stet-stet commented 1 year ago

Thank you for the confirmations! I understand how a low-overhead model may be desired for a real-life service such as Beat Sage, and how a <0.01 point differences in F1 may not be so meaningful.

About Re:, Thank you for the suggestion! I see that comparing (2) and (3) would help illustrate the effects of having more data under our alternate formulation. However, my problem was that the metrics for (2) came out to be quite low compared to (1) or (3). I was afraid that our proposed method may be seen to be of lesser worth than what it really is, since I did find our model to perform much, much better on a larger dataset.

So what I actually want to demonstrate is "given access to enough data, our alternate approach/formulation helps." Again, although comparing (2) and (3) may certainly support this claim, I thought this may not be enough since I'd also expect the baselines to also do better with more data. So I wrote my own training code (in Pytorch) to compare the models on a larger dataset. The test is running right now.

Again, thank you very much for the information, and thank you for this awesome repo!

chrisdonahue commented 1 year ago

Gotcha. This makes sense! Yes, I suspect that the tiny CNN we used for onset placement may not benefit as much from additional training data compared to a more modern approach. Looking forward to reading about your findings!

stet-stet commented 12 months ago

Hello! If you're still interested, here is my demo page, which includes the link to the extended abstract & code used.