Open MHZ9825 opened 4 months ago
Thank you for your message!
I am not sure to understand exactly what you are trying to accomplish.
Is Orchestra a specific dataset? Or do you mean you are using multiple instruments/tracks?
In any way it is possible to use byte pair encoding (or unigram or wordpiece) with multiple instruments, whether you tokenize all tracks all at once (config.one_token_stream_for_programs
enabled) or separately. I invite you to take a look at MidiTok's documentation for more details as this repository is not up-to-date anymore :)
Hi again! Thanks for the reply~ I have checked the documentation and have conveniently used MIDITOK to get the MIDI Token ids. however I still haven't found out how to compress the token sequence using BPE. May I ask how this step should be handled. Can I assume that I need to build my own dataset and then train a Tokenizer of my own using BPE's method?
This is done automatically once the tokenizer has been trained. :)
You can see the documentation of the encode
(to tokenize files) and encode_token_ids
(to apply BPE/Unigram to token ids and reduce the sequence). The latter is automatically called by encode
unless you specify it to not.
Thanks!I 'll try it~
Thanks for your Briliiant work !!! However, I'm trying to use it on Orchestra, which means I have to get the hidden features from a multi-inst, if the work get the way to get the hidden features? It seems I can only get the tokens without inst.