Closed kroll-software closed 1 month ago
Hi, ty for the report! Would you be able to provide a short reproducible example in order to fix the cause of the issue?
Would you be able to provide a short reproducible example in order to fix the cause of the issue?
Dataset: https://huggingface.co/datasets/breadlicker45/bread-midi-dataset
It took 9 hours to train the MidiTokenizer (REMI) with use_programs = True
In the last step, I think it was "Merge words", it tried to allocate more then 200 GiB RAM, more than I have ;)
So I was only able to train the tokenizer on a sub-set of the data.
Later I split the dataset into chunks with content_length = 2048, then I created a tokenized JSON dataset with tokenizer.tokenize_dataset()
The JSONDataset is much faster than the MIDIDataset, iterating through it with a DataCollator takes 2 days less ;)
These are my experiences. I'd suggest that you also try these steps with the huge bread-midi-dataset.
I really like and appreciate the MidiTok project, after some workarounds I got it to run. Let's see, what comes out.
Best regards,
Thank you for the feedback!
I think the original issue occurs in the data collator when all the elements of the batch are None
. In this case I’m not sure of how to resolve it, as models it is unsure of how models will handle empty batches in the training loop. The only sure way to prevent it is by the user curating the data to remove invalid entries.
I’ll investigate the runtimes of the Dataset methods. Do you have a code snippet showing how you used the DatasetMIDI class?
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi, I'm processing the huge bread-midi-dataset.
Training the tokenizer throws a KeyError in midi_tokenizer.py line 1712: ids = [self.vocab[token] for token in tokens]
maybe line 1709 needs a fix as well.
I made a quick and dirty workaround by catching these very rare errors, maybe there's a better solution.
Hope it helps.