Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶
https://miditok.readthedocs.io/
MIT License
651 stars 80 forks source link

Special token question #140

Closed oiabtt closed 7 months ago

oiabtt commented 7 months ago

I'm looking at the documentation and code, and I noticed that these special tokens ["PAD", "BOS", "EOS", "MASK"] are used in the configuration by default(https://miditok.readthedocs.io/en/v3.0.0/bases.html#special-tokens). However, during my experiments, I found that the generated tokens do not include BOS.

config = TokenizerConfig()
tokenizer = Structured(config)
midi = MidiFile(midi_path)
tokens = tokenizer(midi)

tokens
[['TimeShift_2.1.8',  # No BOS
  'Pitch_71',
...
  1. Should I manually add BOS to the tokens? Or is there a way to let the tokenizer automatically add it through the configuration (I couldn't find such a feature in the documentation)?
  2. If manual addition of BOS is required, what is the best practice for doing so?
Natooz commented 7 months ago

Indeed adding the BOS / EOS / PAD tokens is intended to be done by the user. As in NLP, the common good practice is to use a data collator that will add them when creating the batches to be fed to a model. If you are using PyTorch, you can directly use miditok.pytorch_data.DataCollator.