Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶
https://miditok.readthedocs.io/
MIT License
689 stars 82 forks source link

KeyError in midi_tokenizer.py line 1712 #197

Closed kroll-software closed 2 weeks ago

kroll-software commented 1 month ago

Hi, I'm processing the huge bread-midi-dataset.

Training the tokenizer throws a KeyError in midi_tokenizer.py line 1712: ids = [self.vocab[token] for token in tokens]

maybe line 1709 needs a fix as well.

I made a quick and dirty workaround by catching these very rare errors, maybe there's a better solution.

Hope it helps.

Natooz commented 1 month ago

Hi, ty for the report! Would you be able to provide a short reproducible example in order to fix the cause of the issue?

kroll-software commented 1 month ago

Would you be able to provide a short reproducible example in order to fix the cause of the issue?

Dataset: https://huggingface.co/datasets/breadlicker45/bread-midi-dataset

It took 9 hours to train the MidiTokenizer (REMI) with use_programs = True

In the last step, I think it was "Merge words", it tried to allocate more then 200 GiB RAM, more than I have ;)

So I was only able to train the tokenizer on a sub-set of the data.

Later I split the dataset into chunks with content_length = 2048, then I created a tokenized JSON dataset with tokenizer.tokenize_dataset()

The JSONDataset is much faster than the MIDIDataset, iterating through it with a DataCollator takes 2 days less ;)

These are my experiences. I'd suggest that you also try these steps with the huge bread-midi-dataset.

I really like and appreciate the MidiTok project, after some workarounds I got it to run. Let's see, what comes out.

Best regards,

Natooz commented 1 month ago

Thank you for the feedback!

I think the original issue occurs in the data collator when all the elements of the batch are None. In this case I’m not sure of how to resolve it, as models it is unsure of how models will handle empty batches in the training loop. The only sure way to prevent it is by the user curating the data to remove invalid entries.

I’ll investigate the runtimes of the Dataset methods. Do you have a code snippet showing how you used the DatasetMIDI class?

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.