Closed envilk closed 2 years ago
Hi @envilk, thanks for your comment and for this bug report ! I'll look into it in the next few days to fix it.
My guess is that the decoded token is not of type TimeSignature (3.6.8 looks like a Duration token). A check might solve it.
Also for Octuple, CP Word and MuMIDI tokenizations, I will soon give an update so that each tokenizer have several vocabularies, one for each token type. This allows to more easily create Embedding layers of appropriate sizes, and so that a model also returns several sequences of logits of the associated sizes.
Nathan
Thank you for your fast reply!
I'll be looking forward next update :)
Also was struggling with vocabulary recently, because I didn't know how to calculate the overall size (maybe that's simply because I'm missing something). What I'm doing now is taking the max integer in the token list and adding one, and that's my vocabulary size.
👋
I just released v1.1.11 which should solve the crash. By default it checks if the given token type is correct before decoding its value : d930de5f34782d2afd04934035448a9e758e774b
Concerning the Vocabulary size, you can get it simply with len(tokenizer.vocab)
.
But for Octuple it is a bit tricky as there is currently only one vocabulary object for every token type.
I'll comment just below some code that implement Octuple with several Vocabulary objects, which allows to create several torch.nn.Embedding
(or tf or Jax equivalent) for model input and torch.nn.Linear
for output, all of different sizes.
It's however not multitrack, doesn't handle TimeSignature or tempo, but it could easily be added with the code from Octuple
Multi vocabulary (and light) version of Octuple :
from typing import List, Tuple, Dict, Union, Optional, Any
from pathlib import Path, PurePath
import json
from math import ceil
from miditok import MIDITokenizer, Vocabulary, Event
from miditok.constants import MIDI_INSTRUMENTS
from miditoolkit import Instrument, Note, TempoChange
import numpy as np
from constants import PITCH_RANGE, NB_VELOCITIES, ADDITIONAL_TOKENS, BEAT_RES, TIME_DIVISION, TEMPO
class BarPosDurationAllMerged(MIDITokenizer):
""" Modified version of Octuple with no Program (Track) tokens
To use mainly for tasks handling a single track.
:param pitch_range: range of used MIDI pitches
:param beat_res: beat resolutions, with the form:
{(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, ...}
The keys of the dict are tuples indicating a range of beats, ex 0 to 3 for the first bar
The values are the resolution, in samples per beat, of the given range, ex 8
:param nb_velocities: number of velocity bins
:param additional_tokens: specifies additional tokens (time signature, tempo)
:param sos_eos_tokens: adds Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary
:param mask: will add a MASK token to the vocabulary (default: True)
:param params: can be a path to the parameter (json encoded) file or a dictionary
"""
def __init__(self, pitch_range: range = PITCH_RANGE, beat_res: Dict[Tuple[int, int], int] = BEAT_RES,
nb_velocities: int = NB_VELOCITIES, additional_tokens: Dict[str, bool] = ADDITIONAL_TOKENS,
sos_eos_tokens: bool = False, mask: bool = True, params=None):
additional_tokens['Chord'] = False # Incompatible additional token
additional_tokens['Rest'] = False
additional_tokens['Tempo'] = False
additional_tokens['Program'] = False
# used in place of positional encoding
self.max_bar_embedding = 60 # this attribute might increase during encoding
super().__init__(pitch_range, beat_res, nb_velocities, additional_tokens, sos_eos_tokens, mask, params)
def save_params(self, out_dir: Union[str, Path, PurePath]):
""" Override the parent class method to include additional parameter drum pitch range
Saves the base parameters of this encoding in a txt file
Useful to keep track of how a dataset has been tokenized / encoded
It will also save the name of the class used, i.e. the encoding strategy
:param out_dir: output directory to save the file
"""
Path(out_dir).mkdir(parents=True, exist_ok=True)
with open(PurePath(out_dir, 'config').with_suffix(".txt"), 'w') as outfile:
json.dump({'pitch_range': (self.pitch_range.start, self.pitch_range.stop),
'beat_res': {f'{k1}_{k2}': v for (k1, k2), v in self.beat_res.items()},
'nb_velocities': len(self.velocities),
'additional_tokens': self.additional_tokens,
'encoding': self.__class__.__name__,
'max_bar_embedding': self.max_bar_embedding},
outfile)
def add_embedded_pos_enc(self, sample: List[List[int]]) -> List[List[int]]:
"""Adapt the Bar and Position values of a sample split from a bigger sample.
Bar will begin to 0 and be incremented.
:param sample: sample to adapt time
:return: this same sample with bars beginning from 0
"""
first_bar = int(self.vocab[-1].token_to_event[sample[0][-1]].split('_')[1])
for i in range(len(sample)):
new_bar = int(self.vocab[-1].token_to_event[sample[i][-1]].split('_')[1]) - first_bar
sample[i][-1] = self.vocab[-1].event_to_token[f'Bar_{new_bar}']
return sample
def track_to_tokens(self, track: Instrument) -> List[List[int]]:
""" Converts a track (miditoolkit.Instrument object) into a sequence of tokens
A time step is a list of tokens where:
(list index: token type)
0: Pitch
1: Velocity
2: Duration
(3: Position) to be recomputed with self.add_embedded_pos_enc
(4: Bar) to be recomputed with self.add_embedded_pos_enc
:param track: MIDI track to convert
:return: sequence of corresponding tokens
"""
# Make sure the notes are sorted first by their onset (start) times, second by pitch
# notes.sort(key=lambda x: (x.start, x.pitch)) # done in midi_to_tokens
ticks_per_sample = self.current_midi_metadata['time_division'] / max(self.beat_res.values())
ticks_per_bar = self.current_midi_metadata['time_division'] * 4
dur_bins = self.durations_ticks[self.current_midi_metadata['time_division']]
# Check bar embedding limit, update if needed
nb_bars = ceil(max(note.end for note in track.notes) / (self.current_midi_metadata['time_division'] * 4))
if self.max_bar_embedding < nb_bars:
self.vocab[4].add_event(f'Bar_{i}' for i in range(self.max_bar_embedding, nb_bars))
self.max_bar_embedding = nb_bars
tokens = []
current_tick = -1
current_bar = -1
current_pos = -1
for note in track.notes:
# Positions and bars
if note.start != current_tick:
pos_index = int((note.start % ticks_per_bar) / ticks_per_sample)
current_tick = note.start
current_bar = current_tick // ticks_per_bar
current_pos = pos_index
# Note attributes
duration = note.end - note.start
dur_index = np.argmin(np.abs(dur_bins - duration))
token_ts = [self.vocab[0].event_to_token[f'Pitch_{note.pitch}'],
self.vocab[1].event_to_token[f'Velocity_{note.velocity}'],
self.vocab[2].event_to_token[f'Duration_{".".join(map(str, self.durations[dur_index]))}'],
self.vocab[3].event_to_token[f'Position_{current_pos}'],
self.vocab[4].event_to_token[f'Bar_{current_bar}']]
tokens.append(token_ts)
return tokens
def tokens_to_events(self, tokens: List[int]) -> List[Event]:
""" Convert a sequence of tokens in their respective event objects
You can override this method if necessary
:param tokens: sequence of tokens to convert
:return: the sequence of corresponding events
"""
events = []
for i, token in enumerate(tokens):
name, val = self.vocab[i].token_to_event[token].split('_')
events.append(Event(name, None, val, None))
return events
def tokens_to_track(self, tokens: List[List[int]], time_division: Optional[int] = TIME_DIVISION,
program: Optional[Tuple[int, bool]] = (0, False)) -> Tuple[Instrument, List[TempoChange]]:
""" Converts a sequence of tokens into a track object
A time step is a list of tokens where:
(list index: token type)
0: Pitch
1: Velocity
2: Duration
4: Position
5: Bar
:param tokens: sequence of tokens to convert
:param time_division: MIDI time division / resolution, in ticks/beat (of the MIDI to create)
:param program: the MIDI program of the produced track and if it drum, (default (0, False), piano)
:return: the miditoolkit instrument object and tempo changes
"""
assert time_division % max(self.beat_res.values()) == 0, \
f'Invalid time division, please give one divisible by {max(self.beat_res.values())}'
events = [self.tokens_to_events(time_step) for time_step in tokens]
ticks_per_sample = time_division // max(self.beat_res.values())
name = 'Drums' if program[1] else MIDI_INSTRUMENTS[program[0]]['name']
instrument = Instrument(program[0], is_drum=program[1], name=name)
for time_step in events:
if any(tok.value == 'None' for tok in time_step):
continue
# Note attributes
pitch = int(time_step[0].value)
vel = int(time_step[1].value)
duration = self._token_duration_to_ticks(time_step[2].value, time_division)
# Time and track values
current_pos = int(time_step[3].value)
current_bar = int(time_step[4].value)
current_tick = current_bar * time_division * 4 + current_pos * ticks_per_sample
# Append the created note
instrument.notes.append(Note(vel, pitch, current_tick, current_tick + duration))
return instrument, [TempoChange(TEMPO, 0)]
def _create_vocabulary(self, sos_eos_tokens: bool = False) -> List[Vocabulary]:
""" Creates the Vocabulary object of the tokenizer.
See the docstring of the Vocabulary class for more details about how to use it.
NOTE: token index 0 is often used as a padding index during training
:param sos_eos_tokens: will include Start Of Sequence (SOS) and End Of Sequence (tokens)
:return: the vocabulary object
"""
vocab = [Vocabulary({'PAD_None': 0}, mask=True) for _ in range(5)]
# PITCH
vocab[0].add_event(f'Pitch_{i}' for i in self.pitch_range)
# VELOCITY
vocab[1].add_event(f'Velocity_{i}' for i in self.velocities)
# DURATION
vocab[2].add_event(f'Duration_{".".join(map(str, duration))}' for duration in self.durations)
# POSITION
nb_positions = max(self.beat_res.values()) * 4 # 4/4 time signature
vocab[3].add_event(f'Position_{i}' for i in range(nb_positions))
# BAR
# vocab.add_event('Bar_None') # new bar token
vocab[4].add_event(f'Bar_{i}' for i in range(self.max_bar_embedding)) # bar embeddings (positional encoding)
return vocab
def _create_token_types_graph(self) -> Dict[str, List[str]]:
""" Returns a graph (as a dictionary) of the possible token
types successions.
Not relevant for Octuple.
:return: the token types transitions dictionary
"""
return {} # not relevant for this encoding
def token_types_errors(self, tokens: List[List[int]], **kwargs) -> Tuple[Union[float, Any]]:
""" Checks if a sequence of tokens is constituted of good token values and
returns the error ratio (lower is better).
The token types are always the same in Octuple so this methods only checks
if their values are correct:
- a bar token value cannot be < to the current bar (it would go back in time)
- same for positions
- a pitch token should not be present if the same pitch is already played at the current position
:param tokens: sequence of tokens to check
:return: the error ratio (lower is better)
"""
err_time = 0
err_note = 0
err_type = 0
current_bar = current_pos = -1
current_pitches = []
for token in tokens:
if all(token[i] == self.vocab[i]['PAD_None'] for i in range(len(token))):
break
if any(self.vocab[i][token].split('_')[0] in ['PAD', 'MASK'] for i, token in enumerate(token)):
err_type += 1
continue
bar_value = int(self.vocab[4].token_to_event[token[4]].split('_')[1])
pos_value = int(self.vocab[3].token_to_event[token[3]].split('_')[1])
pitch_value = int(self.vocab[0].token_to_event[token[0]].split('_')[1])
# Bar
if bar_value < current_bar:
err_time += 1
elif bar_value > current_bar:
current_bar = bar_value
current_pos = pos_value
current_pitches = []
# Position
elif pos_value < current_pos:
err_time += 1
elif pos_value > current_pos:
current_pos = pos_value
current_pitches = []
# Pitch
if pitch_value in current_pitches:
err_note += 1
else:
current_pitches.append(pitch_value)
return tuple(map(lambda x: x / len(tokens), (err_type, err_time, err_note, 0., 0.)))
def token_types_errors_training(self, x_tokens: List[List[int]], y_tokens: List[List[int]]) \
-> Tuple[Union[float, Any]]:
""" Checks if a sequence of tokens is constituted of good token types
successions and returns the error ratio (lower is better).
The Pitch and Position values are also analyzed:
- a position token cannot have a value <= to the current position (it would go back in time)
- a pitch token should not be present if the same pitch is already played at the current position
:param x_tokens: input tokens
:param y_tokens: produced token to check according to input
:return: the error ratio (lower is better)
"""
err_time = 0
err_note = 0
err_type = 0
current_bar = current_pos = -1
current_pitches = []
for x_tok, y_tok in zip(x_tokens, y_tokens):
if all(x_tok[i] == self.vocab[i]['PAD_None'] for i in range(len(x_tok))) or \
all(y_tok[i] == self.vocab[i]['PAD_None'] for i in range(len(y_tok))):
break
if any(self.vocab[i][token].split('_')[0] in ['PAD', 'MASK'] for i, token in enumerate(x_tok)) or \
any(self.vocab[i][token].split('_')[0] in ['PAD', 'MASK'] for i, token in enumerate(y_tok)):
err_type += 1
continue
x_bar = int(self.vocab[4].token_to_event[x_tok[4]].split('_')[1])
x_pos = int(self.vocab[3].token_to_event[x_tok[3]].split('_')[1])
x_pitch = int(self.vocab[0].token_to_event[x_tok[0]].split('_')[1])
y_bar = int(self.vocab[4].token_to_event[y_tok[4]].split('_')[1])
y_pos = int(self.vocab[3].token_to_event[y_tok[3]].split('_')[1])
y_pitch = int(self.vocab[0].token_to_event[y_tok[0]].split('_')[1])
# Reset current pitches if time has moved
if x_bar > current_bar or x_pos != current_pos:
current_pitches = []
current_bar, current_pos = x_bar, x_pos
current_pitches.append(x_pitch)
if y_bar < current_bar:
err_time += 1
# Position
elif y_pos < current_pos:
err_time += 1
elif y_bar == current_bar and y_pos == current_pos and y_pitch in current_pitches:
err_note += 1
return tuple(map(lambda err: err / len(x_tokens), (err_type, err_time, err_note)))
And multi-input and multi-output modules for Pytorch would look like this (its an example) :
class MultiEmbeddings(Module):
"""Multi-input module, taking several tokens as input, converting them to embeddings and
concatenate them to make a single 'merged' embedding
:param num_classes: number of classes for each token type
:param embedding_sizes: sizes of each embedding type
:param d_model: size of the final embedding, i.e. dimension of the transformer
:param padding_idx: padding index, must be the same for each token type
"""
def __init__(self, num_classes: List[int], embedding_sizes: List[int], d_model: int, padding_idx: int = 0):
assert len(num_classes) == len(embedding_sizes), \
f'The number of classes and embedding sizes must be the same ({len(num_classes)} and ' \
f'{len(embedding_sizes)} were given)'
super().__init__()
self.embedding_layers = ModuleList([Embedding(num_classes[i], embedding_sizes[i], padding_idx)
for i in range(len(num_classes))])
self.proj = Linear(sum(embedding_sizes), d_model)
def forward(self, x) -> Tensor:
"""
:param x: Tokens sequences, shape: (L, N, Z)
:return: Embeddings, as a tensor with a shape (L, N, E)
"""
embeds = []
for i, mod in enumerate(self.embedding_layers):
embeds.append(mod(x[:, :, i]))
x = cat(embeds, dim=-1) # (L, N, sum(embedding_sizes))
return self.proj(x) # (L, N, E)
class MultiOutput(Module):
"""Multi-output module.
:param num_classes: number of classes for each token type
:param d_model: size of the final embedding, i.e. dimension of the transformer
"""
def __init__(self, num_classes: List[int], d_model: int):
super().__init__()
self.output_layers = ModuleList([Linear(d_model, num) for num in num_classes])
def forward(self, x) -> List[Tensor]:
"""
:param x: Tokens sequences, shape: (L, N, E)
:return: List of tensors of shape (L, N, *)
"""
return [out(x) for out in self.output_layers] # (L, N, *)
Still getting errors with the new version of octuple.py (I'll try to put most of them).
1:
MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
14 out = rand_seq[0].cpu().numpy().tolist()
15
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
17 converted_back_midi.dump('output.mid')
18
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
254 pitch = int(events[0].value)
255 vel = int(events[1].value)
--> 256 duration = self._token_duration_to_ticks(events[2].value, time_division)
257
258 # Time and track values
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _token_duration_to_ticks(token_duration, time_division)
364 :return: the duration / time-shift in ticks
365 """
--> 366 beat, pos, res = map(int, token_duration.split('.'))
367 return (beat * res + pos) * time_division // res
368
ValueError: not enough values to unpack (expected 3, got 1)
2:
MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
14 out = rand_seq[0].cpu().numpy().tolist()
15
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
17 converted_back_midi.dump('output.mid')
18
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
254 pitch = int(events[0].value)
255 vel = int(events[1].value)
--> 256 duration = self._token_duration_to_ticks(events[2].value, time_division)
257
258 # Time and track values
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _token_duration_to_ticks(token_duration, time_division)
364 :return: the duration / time-shift in ticks
365 """
--> 366 beat, pos, res = map(int, token_duration.split('.'))
367 return (beat * res + pos) * time_division // res
368
ValueError: invalid literal for int() with base 10: '4/4'
3:
MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
14 out = rand_seq[0].cpu().numpy().tolist()
15
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
17 converted_back_midi.dump('output.mid')
18
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
259 program = int(events[3].value)
260 current_pos = int(events[4].value)
--> 261 current_bar = int(events[5].value)
262 current_tick = current_time_sig_tick + (current_bar - current_time_sig_bar) * ticks_per_bar \
263 + current_pos * ticks_per_sample
ValueError: invalid literal for int() with base 10: '0.5.8'
4:
MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
14 out = rand_seq[0].cpu().numpy().tolist()
15
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
17 converted_back_midi.dump('output.mid')
18
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
253 # Note attributes
254 pitch = int(events[0].value)
--> 255 vel = int(events[1].value)
256 duration = self._token_duration_to_ticks(events[2].value, time_division)
257
ValueError: invalid literal for int() with base 10: '1.0.8'
When executed several times it worked (the generation is only 3 dissonant notes played at the same time), but after it didn't again.
PS: I'll try the vocabulary stuff later on :)
👋
These errors happen when tokens_to_midi()
tries to decode tokens of unexpected type, eg duration instead of pitch, hence int()
produces a crash.
I guess checking each token type at decoding could solve the issue, but I think the real and sustainable solution is to switch to multiple vocabularies. This way, the decoding module of a model would not "mess" with unexpected token types.
I finally implemented it, in the multi-embed-vocabs branch.
Encoding / decoding works just as before, but I did not try it in real condition, on generated token sequences that might contain errors / produce crashes. Do you think you can test it before releasing the update ?
You can just copy/past this file, but in that case you would also need to rewrite the tokens_to_events()
method as here.
Or else I'll test it in the following days.
Oh and vocabulary size can be retrieved with [len(vocab) for vocab in tokenizer.vocab]
to build input / output modules of appropriate sizes.
Let me try this week and whithin some days hopefully I could give you the results.
When changing to the multi-vocabulary branch and trying to generate again in the same circumstances, only this error happens:
MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_340299/761086941.py in <module>
14 out = rand_seq[0].cpu().numpy().tolist()
15
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
17 converted_back_midi.dump('output.mid')
18
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
224 midi = MidiFile(ticks_per_beat=time_division)
225 ticks_per_sample = time_division // max(self.beat_res.values())
--> 226 events = self.tokens_to_events(tokens, multi_voc=True)
227
228 tempo_changes = [TempoChange(TEMPO, 0)]
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in tokens_to_events(self, tokens, multi_voc)
175 multi_event = []
176 for i, token in enumerate(multi_token):
--> 177 name, val = self.vocab[i].token_to_event[token].split('_')
178 multi_event.append(Event(name, None, val, None))
179 events.append(multi_event)
KeyError: 67
The KeyError
can include different numbers.
Even so, it seems that I won't have to build my vocabulary size anymore, and would be able to correctly build models with the new multi-vocabulary approach :smile:
That's great to hear ! So let's chase that bug, can you share a json / csv of the token sequence that causes the error ? Edit: and also the tokenizer config/params ? Here it seems that the token 67 is not in the vocab, but I would need to debug this more deeply. (Do you think it can be an error coming from your model / output module ?)
The config/params of the tokenizer are:
pitch_range = range(21, 109)
beat_res = {(0, 4): 8}
nb_velocities = 32
additional_tokens = {'Chord': False, 'Rest': False, 'Tempo': True, 'Program': True, 'TimeSignature': True,
'nb_tempos': 32, # nb of tempo bins
'tempo_range': (40, 250), # (min, max)
'time_signature_range': (8, 2)} # (max_beat_res, max_bar_length_in_NOTE)
# Creates the tokenizer and list the file paths
tokenizer = Octuple(pitch_range, beat_res, nb_velocities, additional_tokens) # Octuple encoding
When you refer to the token sequence, do you mean the dataset? Or do you mean the 'events' list in the token_to_midi function in octuple.py?
I don't really know if my model or output code can be wrong, just in case I write you some snippets in the Jupyter Notebook:
print('MidiTok Model Trainer')
config = GPTConfig(VOCAB_SIZE,
max_seq,
dim_feedforward=dim_feedforward,
n_layer=4,
n_head=8,
n_embd=256,
enable_rpr=True,
er_len=max_seq)
model = GPT(config).to(get_device())
model.eval()
rand_seq = model.generate(torch.Tensor([1]), target_seq_length=512)
out = rand_seq[0].cpu().numpy().tolist()
converted_back_midi = tokenizer.tokens_to_midi([out], None)
converted_back_midi.dump('output.mid')
print('Done!')
VOCAB_SIZE
is 443 in this execution, max_seq
is 1024, and dim_feedforward
is 512.
PD: Should the vocab size change for different execution of the same dataset? Because I tokenized the dataset another time and it gave me VOCAB_SIZE = 580.
Thank you !
By tokens I am referring to a token sequence produce by the model (list of list of integers in the case of octuple).
I looked at the GPT2Model from hugging face, and the problem (for us here) is that it automatically comes with an Embedding layer, so it can't be used with multi embeddings.
But if you are using PyTorch, the Transformer module is almost exactly the same. Here is how to create the model, with multi input / output modules for octuple: (I did not test it as here, I just wrote this from code blocks I had)
from typing import Optional, List
from math import log
import torch
from torch.nn import Module, Linear, Embedding, ModuleList, Transformer, Dropout
from torch.nn.init import xavier_uniform_
from torch import Tensor, cat, no_grad, triu, ones, stack
class MyTransformer(Module):
def __init__(self, num_layers: int, num_classes: List[int], d_model: int, nhead: int,
dim_feedforward: int, max_seq_len: int, embedding_sizes: List[int] = None,
dropout: float = 0.1, layer_norm_eps: float = 1e-5, device: torch.device = torch.device('cpu'),
padding_token: int = 0):
super().__init__()
head_dim, rest = divmod(d_model, nhead)
assert rest == 0, f'Non valid combination of model dimension ({d_model}) and number of heads ({nhead})'
self.device = device
# POSITIONAL ENCODING
self.pos_enc = AbsolutePositionalEncoding(d_model, max_seq_len)
# Input module
self.embedder = MultiEmbeddings(num_classes, embedding_sizes, d_model, padding_token)
# Transformer
self.transformer = Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_layers,
num_decoder_layers=0, dim_feedforward=dim_feedforward, dropout=dropout,
layer_norm_eps=layer_norm_eps, device=self.device)
# Output module
self.to_logits = MultiOutput(num_classes, d_model)
# INITIALIZATION
for p in self.parameters():
if p.dim() > 1:
xavier_uniform_(p)
self.to(self.device)
def forward(self, tgt: Tensor, attn_mask: Optional[Tensor] = None, key_pad_mask: Optional[Tensor] = None,
causal: bool = False):
"""
:param tgt:
:param attn_mask:
:param key_pad_mask:
:param causal: causal attention, will quickly compute attention with causality
:return:
"""
if attn_mask is None and causal:
tgt_len = tgt.shape[0]
attn_mask = triu(ones(tgt_len, tgt_len) * float('-inf'), diagonal=1).to(self.device) # CAUSAL MASK
tgt = self.embedder(tgt) # (T,N) -> (T,N,E)
tgt = self.pos_enc(tgt)
tgt = self.transformer(tgt, attn_mask, src_key_padding_mask=key_pad_mask, causal=causal)
tgt = self.to_logits(tgt) # (T,N,E) -> list of (T,N,C), C is variable and depends on vocab sizes
return tgt
@no_grad()
def predict(self, x: Tensor, inference_lim: int, max_seq_len: int, top_k: int) -> Tensor:
""" Prediction function for inference
:param x: input tensor (N,T,Z) if multi-input embedding
:param inference_lim: number of inferences
:param max_seq_len: maximum sequence length being process (attention context size)
:param top_k: top k sampling value
:return: the predicted sequence
"""
x = x.transpose(1, 0).to(self.device) # (N,T,) --> (T,N,)
try:
for _ in range(inference_lim):
# Adds the prediction to the target sequence, updates the time values
y = self.forward(x[-max_seq_len:], causal=True) # list of Z (T,N,C*)
y = stack([top_k_sampling(type_[-1], top_k) for type_ in y]).t() # (N,Z)
x = cat([x, y.unsqueeze(0)]) # (T+1,N,Z)
except KeyError: # bar embedding to high
pass
return x.transpose(1, 0) # (N,T,)
class MultiEmbeddings(Module):
"""Multi-input module, taking several tokens as input, converting them to embeddings and
concatenate them to make a single 'merged' embedding
:param num_classes: number of classes for each token type
:param embedding_sizes: sizes of each embedding type
:param d_model: size of the final embedding, i.e. dimension of the transformer
:param padding_idx: padding index, must be the same for each token type
"""
def __init__(self, num_classes: List[int], embedding_sizes: List[int], d_model: int, padding_idx: int = 0):
assert len(num_classes) == len(embedding_sizes), \
f'The number of classes and embedding sizes must be the same ({len(num_classes)} and ' \
f'{len(embedding_sizes)} were given)'
super().__init__()
self.embedding_layers = ModuleList([Embedding(num_classes[i], embedding_sizes[i], padding_idx)
for i in range(len(num_classes))])
self.proj = Linear(sum(embedding_sizes), d_model)
def forward(self, x) -> Tensor:
"""
:param x: Tokens sequences, shape: (L, N, Z)
:return: Embeddings, as a tensor with a shape (L, N, E)
"""
embeds = []
for i, mod in enumerate(self.embedding_layers):
embeds.append(mod(x[:, :, i]))
x = cat(embeds, dim=-1) # (L, N, sum(embedding_sizes))
return self.proj(x) # (L, N, E)
class MultiOutput(Module):
"""Multi-output module.
:param num_classes: number of classes for each token type
:param d_model: size of the final embedding, i.e. dimension of the transformer
"""
def __init__(self, num_classes: List[int], d_model: int):
super().__init__()
self.output_layers = ModuleList([Linear(d_model, num) for num in num_classes])
def forward(self, x) -> List[Tensor]:
"""
:param x: Tokens sequences, shape: (L, N, E)
:return: List of tensors of shape (L, N, *)
"""
return [out(x) for out in self.output_layers] # (L, N, *)
class AbsolutePositionalEncoding(Module):
""" Module injecting positional information in the embeddings of a sequence.
To be used at the beginning of a transformer network, before the first layers.
:param d_model: embedding size
:param max_len: max length of the sequences that will be treated
:param dropout: dropout value
"""
def __init__(self, d_model: int, max_len: int, dropout: float = 0.1):
super().__init__()
self.dropout = Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x: Tensor) -> Tensor:
""" Adds positional encoding to a sequence
:param x: input tensor, shape (sequence length, batch size, embedding size)
:return the tensor with positional encoding
"""
x = x + self.pe[:x.size()[0], :].to(x.device, dtype=x.dtype)
return self.dropout(x)
def top_k_sampling(x: Tensor, k: int, temperature: int = None) -> Tensor:
"""Top K sampling
:param x: input tensor of shape (N,C) or (T,N,C)
:param k: k factor
:param temperature: temperature for softmax
:return: sampling results as (N)
"""
x_copy = x.clone() / temperature if temperature is not None else x.clone()
indices_to_inf = x < torch.topk(x, k)[0][..., -1, None]
x_copy[indices_to_inf] = float('-inf')
if x.dim() == 2: # (N,C)
return torch.multinomial(torch.softmax(x_copy, -1), 1).squeeze(-1)
elif x.dim() == 3: # (T,N,C)
return stack([torch.multinomial(torch.softmax(xi, -1), 1).squeeze(-1) for xi in x_copy])
And the create the model :
embedding_sizes=[256, 128, 128, 192, 64, 128, 128, 64] # I just put random number, the choice is up to you
model = MyTransformer(num_layers=4, num_classes=[len(voc) for voc in tokenizer.vocab], d_model=256, dim_feedforward=1024, max_seq_len=1024, embedding_sizes=embedding_sizes)
For your last question, by 443 and 580 do you mean the sum of the vocabularies ? And yes the size can change between different datasets, as the durations of files would be different, the length of the Bar vocab would also be different.
This is the list of output tokens generated before converting to MIDI:
[1, 25, 21, 5, 10, 54, 29, 8, 2, 29, 29, 9, 10, 22, 24, 1, 2, 30, 1, 9, 10, 30, 22, 1, 2, 30, 1, 9, 10, 29, 22, 1, 2, 31, 1, 9, 10, 38, 26, 1, 2, 29, 1, 9, 10, 50, 26, 1, 2, 32, 1, 9, 10, 18, 24, 3, 2, 32, 1, 9, 10, 34, 22, 1, 2, 32, 1, 9, 10, 58, 26, 2, 2, 32, 1, 9, 10, 39, 26, 2, 2, 4, 1, 9, 10, 46, 23, 1, 2, 8, 1, 9, 10, 34, 25, 9, 2, 8, 1, 9, 10, 41, 22, 6, 2, 8, 13, 9, 10, 51, 23, 4, 2, 12, 13, 9, 10, 42, 23, 2, 2, 14, 13, 9, 10, 26, 20, 4, 2, 14, 13, 9, 10, 41, 22, 3, 2, 14, 13, 9, 10, 43, 24, 2, 2, 18, 13, 9, 10, 19, 24, 3, 2, 18, 14, 9, 10, 20, 20, 2, 2, 18, 14, 9, 10, 41, 22, 4, 2, 18, 14, 9, 10, 41, 21, 2, 2, 26, 14, 9, 10, 46, 21, 3, 2, 26, 14, 9, 10, 65, 27, 4, 2, 28, 14, 9, 10, 29, 25, 4, 2, 28, 14, 9, 10, 39, 21, 32, 2, 30, 14, 9, 10, 46, 27, 2, 2, 30, 14, 9, 10, 29, 21, 9, 2, 2, 14, 9, 10, 41, 25, 10, 2, 2, 14, 9, 10, 41, 19, 8, 2, 6, 14, 1, 10, 43, 23, 5, 2, 6, 14, 9, 10, 44, 19, 1, 2, 6, 14, 9, 10, 50, 29, 5, 2, 2, 14, 9, 10, 41, 25, 5, 2, 6, 14, 9, 10, 46, 25, 3, 2, 6, 14, 9, 10, 22, 27, 4, 2, 12, 14, 9, 10, 48, 27, 4, 2, 14, 14, 9, 10, 26, 20, 4, 2, 14, 14, 9, 10, 38, 22, 7, 2, 14, 14, 9, 10, 37, 19, 7, 2, 18, 14, 3, 10, 38, 21, 25, 2, 18, 14, 9, 10, 38, 26, 6, 2, 20, 14, 9, 10, 42, 24, 6, 2, 22, 14, 9, 10, 49, 21, 6, 2, 26, 14, 9, 10, 51, 22, 6, 2, 28, 14, 9, 10, 29, 22, 6, 2, 30, 14, 9, 10, 31, 22, 7, 2, 2, 14, 9, 10, 41, 24, 4, 2, 6, 14, 9, 10, 47, 25, 4, 2, 6, 14, 9, 10, 40, 26, 2, 2, 10, 14, 9, 10, 48, 26, 6, 2, 14, 14, 9, 10, 62, 26, 3, 2, 14, 14, 9, 10, 41, 26, 5, 2, 14, 14, 9, 10, 41, 26, 4, 2, 14, 14, 9, 10, 48, 28, 4, 2, 14, 14, 9, 10, 43, 22, 9, 2, 14, 14, 9, 10, 48, 30, 9, 2, 18, 14, 9, 10, 57, 28, 9, 2, 18, 14, 9, 10, 61, 26, 1, 2, 22, 14, 9, 10, 57, 28, 3, 2, 26, 49, 9, 10, 38, 27, 3, 2, 30, 14, 9, 10, 43, 29, 4, 2, 2, 14, 9, 10, 44, 26, 3, 2, 2, 14, 13, 10, 46, 28, 9]
The GPT2 architecture I am using is GPT2RGA (the one used in your colab-notebooks
directory), not the one in hugginface. Anyway, it has only one Embedding layer as well. I guess that multi-embedding is necessary in order to not flatten all data and for Octuple to make sense as eight token pack, right? (my input tensors are shaped (4, 1024)
, and without flattening they are shaped (4, 1024, 8)
)
Yeah, for my last question I mean for the sum. What happens is that for the same dataset I can get two different overall sizes in different executions. However, I don't think this is leading to any of the problems described above.
Ok, my bad. Indeed these models are not appropriate for octuple, or even cp word or mimidi or any "multi-embedding" representation. They work by creating embeddings for every token individually (8 here, and of different sizes if you want), then aggregate them (Pitch, Velocity ...) with a "pooling" operation. This operation can either be a concatenation, mean, sum or anything else that aggregate the embeddings. In MusicBert and in the MultiEmbedding module above. the embeddings are concatenated into one big embedding which is then passed through a linear layer to obtain the final embedding (of the size of the model, d_model).
And the output must also decode tokens of every types individually, with different linear layers.
Then during training, one would compute several losses (one for every token type), which you can sum or mean before computing the gradients.
MusicBert paper have a good figure which shows how it works. Cp Word too. Don't hesitate if you have any questions or troubles !
If I have time I'll test the new Octuple tokenizer in the next few days
I get what you say... I might try this kind of architecture, but I don't know how much time it can take me :sweat_smile:
Anyway, do you think that the last KeyError
was given because of this use?
I'll be looking forward to the test!
Yes I am 99% confident this error was caused by the "flattening".
If this takes you too much time, maybe you could just switch to a 1D representation (Remi, Structured etc). I have currently things running, when it's done I'll try this version of octuple.
If this can help you, here is how to compute the several losses
# x is the input sequence, of shape (N,T,Z), T is sequence length, N batch size, Z the different token types
# y are the output logits, is a list of Z tensors of shape (T,N,C*) where C is the vocabulary size, and will vary depending on the token type (pitch, velocity etc...)
losses = []
for j in range(len(tokenizer.vocab)):
losses.append(criterion(y[j].permute(1, 2, 0), x[..., j])) # shapes (N,C,T) and (N,T), see Pytorch cross-entropy for details
loss = sum(losses) # here we sum, but we could also have mean for instance
You were right. I used your transformer class and octuple with some changes, and it seems to work now. The only problem is that I am struggling with the predict function, how would you use it to generate a sequence without primer melody?
PS: Should I open another issue for this? Is this one too far?
That's great to hear ! 😃
I guess without primer melody, you could either 1) use a SOS
(Start Of Sequence) token, or; 2) give only the first note.
The model will then pick the first note based only on one SOS
, meaning that after being trained it will probably always produce the same probability distribution $\mathrm{p}_i \left( \mathbf{x_1} \lvert x_0 \right) \in \mathbb{R}^{C_i}$ where $x_0$ is the SOS
token, $C_i$ the number of classes / vocab size, for each $i$ representing token types (pitch etc) / vocabularies.
I guess this strategy could work if this likelihood has a high entropy so that the first note is not always the same. Otherwise adding some dropout or a higher softmax temperature for this first step might help.
You could maybe select the first note with some random (weighted) procedure.
Autoregressive generation from scratch can be tricky because of this.
Please tell me if you encounter bugs, if (I hope) it's not the case I will probably release this new version in the next few days.
I get what you say. Maybe that's deeper to build than I could, but I will try my best these days :smile:
I'll report everything I find :+1:
Hi again!
Trying some stuff got this error when converting tokens to midi. It looks like "midi_tokenizer_base.py" is not able to process the "None" value of the Time signature parameter that is passed through "tokens_to_midi" in "octuple.py":
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_5271/2731609472.py in <module>
----> 1 converted_back_midi = tokenizer.tokens_to_midi(out[:, 1:].tolist()[0])
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
274 # Time Signature, adds a TimeSignatureChange if necessary
275 if self.additional_tokens['TimeSignature'] and time_step[-1].value == 'None':
--> 276 time_sig = self._parse_token_time_signature(time_step[-1].value)
277 if time_sig != (time_sig_changes[-1].numerator, time_sig_changes[-1].denominator):
278 current_time_sig_tick += (current_bar - current_time_sig_bar) * ticks_per_bar
~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _parse_token_time_signature(token_time_sig)
461 :return: the numerator and denominator of a time signature
462 """
--> 463 numerator, denominator = map(int, token_time_sig.split('/'))
464 return numerator, denominator
465
ValueError: invalid literal for int() with base 10: 'None'
Hi :)
I just corrected it in 960cbfa8eac1750aec1fb95d623e3ab2a51370f1 (a really stupid bug ahah) Hoping to not find any other bugs, if so please tell me. And if after testing with generated tokens you don't encounter any other bug, please also report it so that I release this in the next version. :)
Hi Env,
After a few tests I did not run into bugs, so I released the update in v1.2.0 ! If you get new bugs / crashes please re-open this issue or create a new one ! :)
BTW Octuple is pretty "demanding" in computer ressource, meaning the multi input / output requires a relatively high amount of model parameters (and therefore gpu). The original authors used 8 V100 (32GB vram), which is quite a lot. My results with one V100 weren't very good also, the model often producing errors like predicting Bar / Positions already passed (going backward in time). For smaller hardware / model size, representations like REMI / Structured are more suitable.
Amazing!
I will try to train for many epochs in an Amazon GPU, so when I have results I can tell you. The new version seems to work well!
Hi there @Natooz, hope you are doing good!
Been doing several tests and my results weren't too good either. What you said is true, it uses a lot of GPU to train. It took ~ 5 days to train for 10 epochs (batch_size = 24) with one RTX 3060 (12GB RAM). The results were 3 minutes generations with some structure and harmony, but not so much. Also, generations have a lot of silent parts, meaning that in 3 minutes can exists 3 or 4 kinds of structures with different melodies, which can be pretty unequal from the other.
I also realized that the Transformer code you sent me (which I still appreciate a lot!), has only encoder layers (like BERT). That made me think about 2 things,
encoder_layers = 0
, but didn't work, because in PyTorch encoder and decoder layers work differently, mainly because inside the decoder forward function there is a loop for iterating through layers, which are easy to ignore if none are passed through arguments, unlike the encoder part that directly asks for the first layer, which throws an IndexError. Maybe by doing what I mentioned above, tunning the hyperparameters a little bit (embedding layers, etc), and training it in AWS (what is my next target), the performance could improve. Let's see what happens :smile:
Thank you in advance!
Hi @envilk, thanks for the reaching back with your results !
IMHO this kind of compound representation is not ideal, as in symbolic music embeddings does not carry much information. Whereas in natural language the embedding of a word or subword can represent a large amount of information (eg the embedding of dog will tell how it looks like, does etc), the embedding of a single pitch value doesn't contain much meaning by itself. Hence I don't think it makes a lot of sense to mix embeddings of symbolic music tokens.
Now to adresse the sequence length and memory complexity issues, one can simply use linear transformers which have been proved to work well for music, and also use byte pair encoding (BPE). For the latest, we should remember that Transformer are usually used with vocabularies of large size for text (50k+), so I think that a well BPE encoded vocabulary for symbolic music should work, and at the same time build embeddings carrying more information by themselves. I ran some experiments that I have yet to analyze on that, the tokenizer is working, following the results I will probably release it.
About the Transformer code, using only encoder to generate music is the right thing, but I understand the confusion. If we stick to the definition from the original transformer paper, an encoder layer only applies self-attention while a decoder also applies encoder-decoder attention (also called cross-attention). In fact GPT is made of encoders (according to the definition above), but the authors called it a "decoder" model as it is used a decoder. It brings confusion, whereas GPT2, BERT etc... they are actually almost all the same, just piles of encoder layers. Example in the code of GPT2 from Hugging Face. (add_cross_attention is not used actually, there is only self attention). The Decoder layer is intended to receive an input sequence, and hidden states from the encoder. In practice it is mostly used for seq2seq architectures, typically used for question answering (QA), neural machine translation (NMT) or text summarization tasks. Cross-attention can also be used for multimodal application as in Perceiver
The PyTorch Transformer implementation follows the original definition of encoder and decoder, so using only encoder layers will give you something like GPT or BERT.
Hopping this helps ! 😃
First of all, using the framework has been very useful already!
I am having two kinds of errors and don't know why. I use GPT2 architecture (in repository example notebook) successfully trained and Miditok 1.1.9.
Code structure
Encoding:
Preprocessing:
Output tensors shape from DataLoader:
Generating from scratch:
Errors
When the generating part is executed two kinds of errors could show, this one:
Or this one:
The
ValueError: invalid literal for int() with base 10: '3.6.8'
one can be 'x.x.x' literal, it can change in every execution.Thanks in advance!
PS: Sorry if I made it too long, just wanted to be clear on each point :).