Yujia-Yan / Transkun

A simple yet effective Audio-to-Midi Automatic Piano Transcription system
MIT License
121 stars 10 forks source link

Some params have grad=None during training #17

Closed xavriley closed 9 months ago

xavriley commented 9 months ago

Hi,

Thank you very much for this repo - I'm trying to train this model from scratch on some Saxophone recordings.

Firstly, I was getting weird errors for

It might be worth mentioning these in the README for people who want to train on something other than Maestro.

The error I'm now encountering is during the first epoch

epoch:0 progress:0.000 step:0  loss:5907.2900 gradNorm:12.11 clipValue:28.85 time:0.39
epoch:0 progress:0.000 step:0  loss:5911.5234 gradNorm:12.17 clipValue:23.27 time:0.38
Warning: detected parameter with no gradient that requires gradient:
torch.Size([90, 256])
pitchEmbedding.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([512, 1792])
velocityPredictor.0.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([512])
velocityPredictor.0.bias
Warning: detected parameter with no gradient that requires gradient:
torch.Size([512, 512])
velocityPredictor.3.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([512])
velocityPredictor.3.bias
Warning: detected parameter with no gradient that requires gradient:
torch.Size([128, 512])
velocityPredictor.6.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([128])
velocityPredictor.6.bias
Warning: detected parameter with no gradient that requires gradient:
torch.Size([512, 1792])
refinedOFPredictor.0.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([512])
refinedOFPredictor.0.bias
Warning: detected parameter with no gradient that requires gradient:
torch.Size([128, 512])
refinedOFPredictor.3.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([128])
refinedOFPredictor.3.bias
Warning: detected parameter with no gradient that requires gradient:
torch.Size([2, 128])
refinedOFPredictor.6.weight
Warning: detected parameter with no gradient that requires gradient:
torch.Size([2])
refinedOFPredictor.6.bias
Traceback (most recent call last):
  File "/import/linux/python/3.8.2/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/import/linux/python/3.8.2/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/import/research_c4dm/jxr01/Skipping-The-Frame-Level/transkun/train.py", line 364, in <module>
    train(0, 1, saved_filename, int(time.time()), args)
  File "/import/research_c4dm/jxr01/Skipping-The-Frame-Level/transkun/train.py", line 199, in train
    average_gradients(model, totalLen, parallel)
  File "/import/research_c4dm/jxr01/Skipping-The-Frame-Level/transkun/TrainUtil.py", line 45, in average_gradients
    param.grad.data /= c
AttributeError: 'NoneType' object has no attribute 'data'

It looks like many of the parameters don't have their gradients initialised. This is strange because at this point in the run it has completed a backward pass so I thought all the gradients should have been set. I'm using the following settings to train:

python3 -m transkun.train --nProcess 1 --batchSize 1 --hopSize 5 --chunkSize 10 --datasetPath "/import/research_c4dm/jxr01/bytedance_piano_transcription/filosax_train/" --datasetMetaFile_train "filosax_data/train.pickle" --datasetMetaFile_val "filosax_data/val.pickle" --augment checkpoint/filosax_model

Can you give me any tips on what to try next?

xavriley commented 9 months ago

I've solved this after a good night's sleep 😅

In my case this was a data issue. I was using a chunk size of 10 seconds and in most of my training data there were long notes being held toward the end of the piece. The notesStrictlyContained setting meant that, in some cases, the note was removed leaving lots of frame activity with no note associated which caused the gradients to blow up.

The fix in my case was to take 15 seconds off the duration value when building the dataset which avoids these edge cases in my data. Leaving this here in case it helps others.