Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶
https://miditok.readthedocs.io/
MIT License
686 stars 82 forks source link

ValueError: invalid literal for int() with base 10: '3.6.8' OR ValueError: not enough values to unpack (expected 2, got 1) #13

Closed envilk closed 2 years ago

envilk commented 2 years ago

First of all, using the framework has been very useful already!

I am having two kinds of errors and don't know why. I use GPT2 architecture (in repository example notebook) successfully trained and Miditok 1.1.9.

Code structure

Encoding:

pitch_range = range(21, 109)
beat_res = {(0, 4): 8}
nb_velocities = 32
additional_tokens = {'Chord': False, 'Rest': False, 'Tempo': True, 'Program': True, 'TimeSignature': True,
                     'nb_tempos': 32,
                     'tempo_range': (40, 250),
                     'time_signature_range': (8, 2)}
tokenizer = Octuple(pitch_range, beat_res, nb_velocities, additional_tokens)

Preprocessing:

# Converts MIDI files to tokens saved as JSON files
tokenizer.tokenize_midi_dataset(paths, relative_path_to_json, midi_valid)

json_paths = list(path.Path(relative_path_to_json).glob('*.json'))
entire_pop909_json_with_bools = []

for json_file in json_paths:
    with open(json_file) as f:
        data = json.load(f)
        entire_pop909_json_with_bools.extend(data) # where elements are found in the list of lists

entire_pop909_json_list = []
# just take song tokens, not boolean track signs
for slot in entire_pop909_json_with_bools:
    if False not in slot[0]: # TAKE CARE: just for Pop909 dataset
        entire_pop909_json_list.append(slot)

flatten_different_songs = [item for sublist in entire_pop909_json_list for item in sublist]
# just trying to make token units to fit the [4, 1024] shape, otherwise it would be [4, 1024, 8]
flatten_time_steps = [item for sublist in flatten_different_songs for item in sublist]

train_data = []
train_data.extend(flatten_time_steps)

Output tensors shape from DataLoader:

Train loader
X shape: torch.Size([4, 1024])
Target shape: torch.Size([4, 1024])

Generating from scratch:

rand_seq = model.generate(torch.Tensor([1]), target_seq_length=512)
out = rand_seq[0].cpu().numpy().tolist()

converted_back_midi = tokenizer.tokens_to_midi([out], None)
converted_back_midi.dump('output.mid')

Errors

When the generating part is executed two kinds of errors could show, this one:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_5234/3425966451.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('4_model_1_OUTPUT(256).mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    230 
    231         if self.additional_tokens['TimeSignature']:
--> 232             time_sig = self._parse_token_time_signature(self.tokens_to_events(tokens[0])[-1].value)
    233         else:  # default
    234             time_sig = TIME_SIGNATURE

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _parse_token_time_signature(token_time_sig)
    447         :return: the numerator and denominator of a time signature
    448         """
--> 449         numerator, denominator = map(int, token_time_sig.split('/'))
    450         return numerator, denominator
    451 

ValueError: invalid literal for int() with base 10: '3.6.8'

Or this one:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_5234/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    230 
    231         if self.additional_tokens['TimeSignature']:
--> 232             time_sig = self._parse_token_time_signature(self.tokens_to_events(tokens[0])[-1].value)
    233         else:  # default
    234             time_sig = TIME_SIGNATURE

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _parse_token_time_signature(token_time_sig)
    447         :return: the numerator and denominator of a time signature
    448         """
--> 449         numerator, denominator = map(int, token_time_sig.split('/'))
    450         return numerator, denominator
    451 

ValueError: not enough values to unpack (expected 2, got 1)

The ValueError: invalid literal for int() with base 10: '3.6.8' one can be 'x.x.x' literal, it can change in every execution.

Thanks in advance!

PS: Sorry if I made it too long, just wanted to be clear on each point :).

Natooz commented 2 years ago

Hi @envilk, thanks for your comment and for this bug report ! I'll look into it in the next few days to fix it.

My guess is that the decoded token is not of type TimeSignature (3.6.8 looks like a Duration token). A check might solve it.

Also for Octuple, CP Word and MuMIDI tokenizations, I will soon give an update so that each tokenizer have several vocabularies, one for each token type. This allows to more easily create Embedding layers of appropriate sizes, and so that a model also returns several sequences of logits of the associated sizes.

Nathan

envilk commented 2 years ago

Thank you for your fast reply!

I'll be looking forward next update :)

Also was struggling with vocabulary recently, because I didn't know how to calculate the overall size (maybe that's simply because I'm missing something). What I'm doing now is taking the max integer in the token list and adding one, and that's my vocabulary size.

Natooz commented 2 years ago

👋

I just released v1.1.11 which should solve the crash. By default it checks if the given token type is correct before decoding its value : d930de5f34782d2afd04934035448a9e758e774b

Concerning the Vocabulary size, you can get it simply with len(tokenizer.vocab). But for Octuple it is a bit tricky as there is currently only one vocabulary object for every token type. I'll comment just below some code that implement Octuple with several Vocabulary objects, which allows to create several torch.nn.Embedding (or tf or Jax equivalent) for model input and torch.nn.Linear for output, all of different sizes. It's however not multitrack, doesn't handle TimeSignature or tempo, but it could easily be added with the code from Octuple

Natooz commented 2 years ago

Multi vocabulary (and light) version of Octuple :

from typing import List, Tuple, Dict, Union, Optional, Any
from pathlib import Path, PurePath
import json
from math import ceil

from miditok import MIDITokenizer, Vocabulary, Event
from miditok.constants import MIDI_INSTRUMENTS
from miditoolkit import Instrument, Note, TempoChange
import numpy as np

from constants import PITCH_RANGE, NB_VELOCITIES, ADDITIONAL_TOKENS, BEAT_RES, TIME_DIVISION, TEMPO

class BarPosDurationAllMerged(MIDITokenizer):
    """ Modified version of Octuple with no Program (Track) tokens
    To use mainly for tasks handling a single track.

    :param pitch_range: range of used MIDI pitches
    :param beat_res: beat resolutions, with the form:
            {(beat_x1, beat_x2): beat_res_1, (beat_x2, beat_x3): beat_res_2, ...}
            The keys of the dict are tuples indicating a range of beats, ex 0 to 3 for the first bar
            The values are the resolution, in samples per beat, of the given range, ex 8
    :param nb_velocities: number of velocity bins
    :param additional_tokens: specifies additional tokens (time signature, tempo)
    :param sos_eos_tokens: adds Start Of Sequence (SOS) and End Of Sequence (EOS) tokens to the vocabulary
    :param mask: will add a MASK token to the vocabulary (default: True)
    :param params: can be a path to the parameter (json encoded) file or a dictionary
    """

    def __init__(self, pitch_range: range = PITCH_RANGE, beat_res: Dict[Tuple[int, int], int] = BEAT_RES,
                 nb_velocities: int = NB_VELOCITIES, additional_tokens: Dict[str, bool] = ADDITIONAL_TOKENS,
                 sos_eos_tokens: bool = False, mask: bool = True, params=None):
        additional_tokens['Chord'] = False  # Incompatible additional token
        additional_tokens['Rest'] = False
        additional_tokens['Tempo'] = False
        additional_tokens['Program'] = False
        # used in place of positional encoding
        self.max_bar_embedding = 60  # this attribute might increase during encoding
        super().__init__(pitch_range, beat_res, nb_velocities, additional_tokens, sos_eos_tokens, mask, params)

    def save_params(self, out_dir: Union[str, Path, PurePath]):
        """ Override the parent class method to include additional parameter drum pitch range
        Saves the base parameters of this encoding in a txt file
        Useful to keep track of how a dataset has been tokenized / encoded
        It will also save the name of the class used, i.e. the encoding strategy

        :param out_dir: output directory to save the file
        """
        Path(out_dir).mkdir(parents=True, exist_ok=True)
        with open(PurePath(out_dir, 'config').with_suffix(".txt"), 'w') as outfile:
            json.dump({'pitch_range': (self.pitch_range.start, self.pitch_range.stop),
                       'beat_res': {f'{k1}_{k2}': v for (k1, k2), v in self.beat_res.items()},
                       'nb_velocities': len(self.velocities),
                       'additional_tokens': self.additional_tokens,
                       'encoding': self.__class__.__name__,
                       'max_bar_embedding': self.max_bar_embedding},
                      outfile)

    def add_embedded_pos_enc(self, sample: List[List[int]]) -> List[List[int]]:
        """Adapt the Bar and Position values of a sample split from a bigger sample.
        Bar will begin to 0 and be incremented.

        :param sample: sample to adapt time
        :return: this same sample with bars beginning from 0
        """
        first_bar = int(self.vocab[-1].token_to_event[sample[0][-1]].split('_')[1])
        for i in range(len(sample)):
            new_bar = int(self.vocab[-1].token_to_event[sample[i][-1]].split('_')[1]) - first_bar
            sample[i][-1] = self.vocab[-1].event_to_token[f'Bar_{new_bar}']
        return sample

    def track_to_tokens(self, track: Instrument) -> List[List[int]]:
        """ Converts a track (miditoolkit.Instrument object) into a sequence of tokens
        A time step is a list of tokens where:
            (list index: token type)
            0: Pitch
            1: Velocity
            2: Duration
            (3: Position) to be recomputed with self.add_embedded_pos_enc
            (4: Bar) to be recomputed with self.add_embedded_pos_enc

        :param track: MIDI track to convert
        :return: sequence of corresponding tokens
        """
        # Make sure the notes are sorted first by their onset (start) times, second by pitch
        # notes.sort(key=lambda x: (x.start, x.pitch))  # done in midi_to_tokens
        ticks_per_sample = self.current_midi_metadata['time_division'] / max(self.beat_res.values())
        ticks_per_bar = self.current_midi_metadata['time_division'] * 4
        dur_bins = self.durations_ticks[self.current_midi_metadata['time_division']]

        # Check bar embedding limit, update if needed
        nb_bars = ceil(max(note.end for note in track.notes) / (self.current_midi_metadata['time_division'] * 4))
        if self.max_bar_embedding < nb_bars:
            self.vocab[4].add_event(f'Bar_{i}' for i in range(self.max_bar_embedding, nb_bars))
            self.max_bar_embedding = nb_bars

        tokens = []
        current_tick = -1
        current_bar = -1
        current_pos = -1
        for note in track.notes:
            # Positions and bars
            if note.start != current_tick:
                pos_index = int((note.start % ticks_per_bar) / ticks_per_sample)
                current_tick = note.start
                current_bar = current_tick // ticks_per_bar
                current_pos = pos_index

            # Note attributes
            duration = note.end - note.start
            dur_index = np.argmin(np.abs(dur_bins - duration))
            token_ts = [self.vocab[0].event_to_token[f'Pitch_{note.pitch}'],
                        self.vocab[1].event_to_token[f'Velocity_{note.velocity}'],
                        self.vocab[2].event_to_token[f'Duration_{".".join(map(str, self.durations[dur_index]))}'],
                        self.vocab[3].event_to_token[f'Position_{current_pos}'],
                        self.vocab[4].event_to_token[f'Bar_{current_bar}']]

            tokens.append(token_ts)

        return tokens

    def tokens_to_events(self, tokens: List[int]) -> List[Event]:
        """ Convert a sequence of tokens in their respective event objects
        You can override this method if necessary

        :param tokens: sequence of tokens to convert
        :return: the sequence of corresponding events
        """
        events = []
        for i, token in enumerate(tokens):
            name, val = self.vocab[i].token_to_event[token].split('_')
            events.append(Event(name, None, val, None))
        return events

    def tokens_to_track(self, tokens: List[List[int]], time_division: Optional[int] = TIME_DIVISION,
                        program: Optional[Tuple[int, bool]] = (0, False)) -> Tuple[Instrument, List[TempoChange]]:
        """ Converts a sequence of tokens into a track object
        A time step is a list of tokens where:
            (list index: token type)
            0: Pitch
            1: Velocity
            2: Duration
            4: Position
            5: Bar

        :param tokens: sequence of tokens to convert
        :param time_division: MIDI time division / resolution, in ticks/beat (of the MIDI to create)
        :param program: the MIDI program of the produced track and if it drum, (default (0, False), piano)
        :return: the miditoolkit instrument object and tempo changes
        """
        assert time_division % max(self.beat_res.values()) == 0, \
            f'Invalid time division, please give one divisible by {max(self.beat_res.values())}'
        events = [self.tokens_to_events(time_step) for time_step in tokens]

        ticks_per_sample = time_division // max(self.beat_res.values())
        name = 'Drums' if program[1] else MIDI_INSTRUMENTS[program[0]]['name']
        instrument = Instrument(program[0], is_drum=program[1], name=name)

        for time_step in events:
            if any(tok.value == 'None' for tok in time_step):
                continue
            # Note attributes
            pitch = int(time_step[0].value)
            vel = int(time_step[1].value)
            duration = self._token_duration_to_ticks(time_step[2].value, time_division)

            # Time and track values
            current_pos = int(time_step[3].value)
            current_bar = int(time_step[4].value)
            current_tick = current_bar * time_division * 4 + current_pos * ticks_per_sample

            # Append the created note
            instrument.notes.append(Note(vel, pitch, current_tick, current_tick + duration))

        return instrument, [TempoChange(TEMPO, 0)]

    def _create_vocabulary(self, sos_eos_tokens: bool = False) -> List[Vocabulary]:
        """ Creates the Vocabulary object of the tokenizer.
        See the docstring of the Vocabulary class for more details about how to use it.
        NOTE: token index 0 is often used as a padding index during training

        :param sos_eos_tokens: will include Start Of Sequence (SOS) and End Of Sequence (tokens)
        :return: the vocabulary object
        """
        vocab = [Vocabulary({'PAD_None': 0}, mask=True) for _ in range(5)]

        # PITCH
        vocab[0].add_event(f'Pitch_{i}' for i in self.pitch_range)

        # VELOCITY
        vocab[1].add_event(f'Velocity_{i}' for i in self.velocities)

        # DURATION
        vocab[2].add_event(f'Duration_{".".join(map(str, duration))}' for duration in self.durations)

        # POSITION
        nb_positions = max(self.beat_res.values()) * 4  # 4/4 time signature
        vocab[3].add_event(f'Position_{i}' for i in range(nb_positions))

        # BAR
        # vocab.add_event('Bar_None')  # new bar token
        vocab[4].add_event(f'Bar_{i}' for i in range(self.max_bar_embedding))  # bar embeddings (positional encoding)

        return vocab

    def _create_token_types_graph(self) -> Dict[str, List[str]]:
        """ Returns a graph (as a dictionary) of the possible token
        types successions.
        Not relevant for Octuple.

        :return: the token types transitions dictionary
        """
        return {}  # not relevant for this encoding

    def token_types_errors(self, tokens: List[List[int]], **kwargs) -> Tuple[Union[float, Any]]:
        """ Checks if a sequence of tokens is constituted of good token values and
        returns the error ratio (lower is better).
        The token types are always the same in Octuple so this methods only checks
        if their values are correct:
            - a bar token value cannot be < to the current bar (it would go back in time)
            - same for positions
            - a pitch token should not be present if the same pitch is already played at the current position

        :param tokens: sequence of tokens to check
        :return: the error ratio (lower is better)
        """
        err_time = 0
        err_note = 0
        err_type = 0
        current_bar = current_pos = -1
        current_pitches = []

        for token in tokens:
            if all(token[i] == self.vocab[i]['PAD_None'] for i in range(len(token))):
                break
            if any(self.vocab[i][token].split('_')[0] in ['PAD', 'MASK'] for i, token in enumerate(token)):
                err_type += 1
                continue
            bar_value = int(self.vocab[4].token_to_event[token[4]].split('_')[1])
            pos_value = int(self.vocab[3].token_to_event[token[3]].split('_')[1])
            pitch_value = int(self.vocab[0].token_to_event[token[0]].split('_')[1])

            # Bar
            if bar_value < current_bar:
                err_time += 1
            elif bar_value > current_bar:
                current_bar = bar_value
                current_pos = pos_value
                current_pitches = []
            # Position
            elif pos_value < current_pos:
                err_time += 1
            elif pos_value > current_pos:
                current_pos = pos_value
                current_pitches = []

            # Pitch
            if pitch_value in current_pitches:
                err_note += 1
            else:
                current_pitches.append(pitch_value)

        return tuple(map(lambda x: x / len(tokens), (err_type, err_time, err_note, 0., 0.)))

    def token_types_errors_training(self, x_tokens: List[List[int]], y_tokens: List[List[int]]) \
            -> Tuple[Union[float, Any]]:
        """ Checks if a sequence of tokens is constituted of good token types
        successions and returns the error ratio (lower is better).
        The Pitch and Position values are also analyzed:
            - a position token cannot have a value <= to the current position (it would go back in time)
            - a pitch token should not be present if the same pitch is already played at the current position
        :param x_tokens: input tokens
        :param y_tokens: produced token to check according to input
        :return: the error ratio (lower is better)
        """

        err_time = 0
        err_note = 0
        err_type = 0
        current_bar = current_pos = -1
        current_pitches = []

        for x_tok, y_tok in zip(x_tokens, y_tokens):
            if all(x_tok[i] == self.vocab[i]['PAD_None'] for i in range(len(x_tok))) or \
                    all(y_tok[i] == self.vocab[i]['PAD_None'] for i in range(len(y_tok))):
                break
            if any(self.vocab[i][token].split('_')[0] in ['PAD', 'MASK'] for i, token in enumerate(x_tok)) or \
                    any(self.vocab[i][token].split('_')[0] in ['PAD', 'MASK'] for i, token in enumerate(y_tok)):
                err_type += 1
                continue

            x_bar = int(self.vocab[4].token_to_event[x_tok[4]].split('_')[1])
            x_pos = int(self.vocab[3].token_to_event[x_tok[3]].split('_')[1])
            x_pitch = int(self.vocab[0].token_to_event[x_tok[0]].split('_')[1])

            y_bar = int(self.vocab[4].token_to_event[y_tok[4]].split('_')[1])
            y_pos = int(self.vocab[3].token_to_event[y_tok[3]].split('_')[1])
            y_pitch = int(self.vocab[0].token_to_event[y_tok[0]].split('_')[1])

            # Reset current pitches if time has moved
            if x_bar > current_bar or x_pos != current_pos:
                current_pitches = []
            current_bar, current_pos = x_bar, x_pos
            current_pitches.append(x_pitch)

            if y_bar < current_bar:
                err_time += 1
            # Position
            elif y_pos < current_pos:
                err_time += 1
            elif y_bar == current_bar and y_pos == current_pos and y_pitch in current_pitches:
                err_note += 1

        return tuple(map(lambda err: err / len(x_tokens), (err_type, err_time, err_note)))

And multi-input and multi-output modules for Pytorch would look like this (its an example) :

class MultiEmbeddings(Module):
    """Multi-input module, taking several tokens as input, converting them to embeddings and
    concatenate them to make a single 'merged' embedding

    :param num_classes: number of classes for each token type
    :param embedding_sizes: sizes of each embedding type
    :param d_model: size of the final embedding, i.e. dimension of the transformer
    :param padding_idx: padding index, must be the same for each token type
    """
    def __init__(self, num_classes: List[int], embedding_sizes: List[int], d_model: int, padding_idx: int = 0):
        assert len(num_classes) == len(embedding_sizes), \
            f'The number of classes and embedding sizes must be the same ({len(num_classes)} and ' \
            f'{len(embedding_sizes)} were given)'
        super().__init__()
        self.embedding_layers = ModuleList([Embedding(num_classes[i], embedding_sizes[i], padding_idx)
                                            for i in range(len(num_classes))])
        self.proj = Linear(sum(embedding_sizes), d_model)

    def forward(self, x) -> Tensor:
        """

        :param x: Tokens sequences, shape: (L, N, Z)
        :return: Embeddings, as a tensor with a shape (L, N, E)
        """
        embeds = []
        for i, mod in enumerate(self.embedding_layers):
            embeds.append(mod(x[:, :, i]))
        x = cat(embeds, dim=-1)  # (L, N, sum(embedding_sizes))
        return self.proj(x)  # (L, N, E)

class MultiOutput(Module):
    """Multi-output module.

    :param num_classes: number of classes for each token type
    :param d_model: size of the final embedding, i.e. dimension of the transformer
    """
    def __init__(self, num_classes: List[int], d_model: int):
        super().__init__()
        self.output_layers = ModuleList([Linear(d_model, num) for num in num_classes])

    def forward(self, x) -> List[Tensor]:
        """

        :param x: Tokens sequences, shape: (L, N, E)
        :return: List of tensors of shape (L, N, *)
        """
        return [out(x) for out in self.output_layers]  # (L, N, *)
envilk commented 2 years ago

Still getting errors with the new version of octuple.py (I'll try to put most of them).

1:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    254             pitch = int(events[0].value)
    255             vel = int(events[1].value)
--> 256             duration = self._token_duration_to_ticks(events[2].value, time_division)
    257 
    258             # Time and track values

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _token_duration_to_ticks(token_duration, time_division)
    364         :return: the duration / time-shift in ticks
    365         """
--> 366         beat, pos, res = map(int, token_duration.split('.'))
    367         return (beat * res + pos) * time_division // res
    368 

ValueError: not enough values to unpack (expected 3, got 1)

2:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    254             pitch = int(events[0].value)
    255             vel = int(events[1].value)
--> 256             duration = self._token_duration_to_ticks(events[2].value, time_division)
    257 
    258             # Time and track values

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _token_duration_to_ticks(token_duration, time_division)
    364         :return: the duration / time-shift in ticks
    365         """
--> 366         beat, pos, res = map(int, token_duration.split('.'))
    367         return (beat * res + pos) * time_division // res
    368 

ValueError: invalid literal for int() with base 10: '4/4'

3:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    259             program = int(events[3].value)
    260             current_pos = int(events[4].value)
--> 261             current_bar = int(events[5].value)
    262             current_tick = current_time_sig_tick + (current_bar - current_time_sig_bar) * ticks_per_bar \
    263                            + current_pos * ticks_per_sample

ValueError: invalid literal for int() with base 10: '0.5.8'

4:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_284739/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    253             # Note attributes
    254             pitch = int(events[0].value)
--> 255             vel = int(events[1].value)
    256             duration = self._token_duration_to_ticks(events[2].value, time_division)
    257 

ValueError: invalid literal for int() with base 10: '1.0.8'

When executed several times it worked (the generation is only 3 dissonant notes played at the same time), but after it didn't again.

PS: I'll try the vocabulary stuff later on :)

Natooz commented 2 years ago

👋

These errors happen when tokens_to_midi() tries to decode tokens of unexpected type, eg duration instead of pitch, hence int() produces a crash.

I guess checking each token type at decoding could solve the issue, but I think the real and sustainable solution is to switch to multiple vocabularies. This way, the decoding module of a model would not "mess" with unexpected token types.

I finally implemented it, in the multi-embed-vocabs branch. Encoding / decoding works just as before, but I did not try it in real condition, on generated token sequences that might contain errors / produce crashes. Do you think you can test it before releasing the update ? You can just copy/past this file, but in that case you would also need to rewrite the tokens_to_events() method as here. Or else I'll test it in the following days.

Oh and vocabulary size can be retrieved with [len(vocab) for vocab in tokenizer.vocab] to build input / output modules of appropriate sizes.

envilk commented 2 years ago

Let me try this week and whithin some days hopefully I could give you the results.

envilk commented 2 years ago

When changing to the multi-vocabulary branch and trying to generate again in the same circumstances, only this error happens:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_340299/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    224         midi = MidiFile(ticks_per_beat=time_division)
    225         ticks_per_sample = time_division // max(self.beat_res.values())
--> 226         events = self.tokens_to_events(tokens, multi_voc=True)
    227 
    228         tempo_changes = [TempoChange(TEMPO, 0)]

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in tokens_to_events(self, tokens, multi_voc)
    175                 multi_event = []
    176                 for i, token in enumerate(multi_token):
--> 177                     name, val = self.vocab[i].token_to_event[token].split('_')
    178                     multi_event.append(Event(name, None, val, None))
    179                 events.append(multi_event)

KeyError: 67

The KeyError can include different numbers.

Even so, it seems that I won't have to build my vocabulary size anymore, and would be able to correctly build models with the new multi-vocabulary approach :smile:

Natooz commented 2 years ago

That's great to hear ! So let's chase that bug, can you share a json / csv of the token sequence that causes the error ? Edit: and also the tokenizer config/params ? Here it seems that the token 67 is not in the vocab, but I would need to debug this more deeply. (Do you think it can be an error coming from your model / output module ?)

envilk commented 2 years ago

The config/params of the tokenizer are:

pitch_range = range(21, 109)
beat_res = {(0, 4): 8}
nb_velocities = 32
additional_tokens = {'Chord': False, 'Rest': False, 'Tempo': True, 'Program': True, 'TimeSignature': True,
                     'nb_tempos': 32,  # nb of tempo bins
                     'tempo_range': (40, 250), # (min, max)
                     'time_signature_range': (8, 2)} # (max_beat_res, max_bar_length_in_NOTE)

# Creates the tokenizer and list the file paths
tokenizer = Octuple(pitch_range, beat_res, nb_velocities, additional_tokens) # Octuple encoding

When you refer to the token sequence, do you mean the dataset? Or do you mean the 'events' list in the token_to_midi function in octuple.py?

I don't really know if my model or output code can be wrong, just in case I write you some snippets in the Jupyter Notebook:

print('MidiTok Model Trainer')

config = GPTConfig(VOCAB_SIZE, 
                   max_seq,
                   dim_feedforward=dim_feedforward,
                   n_layer=4, 
                   n_head=8,
                   n_embd=256,
                   enable_rpr=True,
                   er_len=max_seq)
model = GPT(config).to(get_device())
model.eval()

rand_seq = model.generate(torch.Tensor([1]), target_seq_length=512)
out = rand_seq[0].cpu().numpy().tolist()

converted_back_midi = tokenizer.tokens_to_midi([out], None)
converted_back_midi.dump('output.mid')

print('Done!')

VOCAB_SIZE is 443 in this execution, max_seq is 1024, and dim_feedforward is 512.

PD: Should the vocab size change for different execution of the same dataset? Because I tokenized the dataset another time and it gave me VOCAB_SIZE = 580.

Natooz commented 2 years ago

Thank you !

By tokens I am referring to a token sequence produce by the model (list of list of integers in the case of octuple).

I looked at the GPT2Model from hugging face, and the problem (for us here) is that it automatically comes with an Embedding layer, so it can't be used with multi embeddings.

But if you are using PyTorch, the Transformer module is almost exactly the same. Here is how to create the model, with multi input / output modules for octuple: (I did not test it as here, I just wrote this from code blocks I had)

from typing import Optional, List
from math import log

import torch
from torch.nn import Module, Linear, Embedding, ModuleList, Transformer, Dropout
from torch.nn.init import xavier_uniform_
from torch import Tensor, cat, no_grad, triu, ones, stack

class MyTransformer(Module):
    def __init__(self, num_layers: int, num_classes: List[int], d_model: int, nhead: int,
                 dim_feedforward: int, max_seq_len: int, embedding_sizes: List[int] = None,
                 dropout: float = 0.1, layer_norm_eps: float = 1e-5, device: torch.device = torch.device('cpu'),
                 padding_token: int = 0):
        super().__init__()
        head_dim, rest = divmod(d_model, nhead)
        assert rest == 0, f'Non valid combination of model dimension ({d_model}) and number of heads ({nhead})'
        self.device = device

        # POSITIONAL ENCODING
        self.pos_enc = AbsolutePositionalEncoding(d_model, max_seq_len)

        # Input module
        self.embedder = MultiEmbeddings(num_classes, embedding_sizes, d_model, padding_token)

        # Transformer
        self.transformer = Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_layers,
                                       num_decoder_layers=0, dim_feedforward=dim_feedforward, dropout=dropout,
                                       layer_norm_eps=layer_norm_eps, device=self.device)

        # Output module
        self.to_logits = MultiOutput(num_classes, d_model)

        # INITIALIZATION
        for p in self.parameters():
            if p.dim() > 1:
                xavier_uniform_(p)
        self.to(self.device)

    def forward(self, tgt: Tensor, attn_mask: Optional[Tensor] = None, key_pad_mask: Optional[Tensor] = None,
                causal: bool = False):
        """

        :param tgt:
        :param attn_mask:
        :param key_pad_mask:
        :param causal: causal attention, will quickly compute attention with causality
        :return:
        """
        if attn_mask is None and causal:
            tgt_len = tgt.shape[0]
            attn_mask = triu(ones(tgt_len, tgt_len) * float('-inf'), diagonal=1).to(self.device)  # CAUSAL MASK

        tgt = self.embedder(tgt)  # (T,N) -> (T,N,E)
        tgt = self.pos_enc(tgt)
        tgt = self.transformer(tgt, attn_mask, src_key_padding_mask=key_pad_mask, causal=causal)
        tgt = self.to_logits(tgt)  # (T,N,E) -> list of (T,N,C), C is variable and depends on vocab sizes
        return tgt

    @no_grad()
    def predict(self, x: Tensor, inference_lim: int, max_seq_len: int, top_k: int) -> Tensor:
        """ Prediction function for inference

        :param x: input tensor (N,T,Z) if multi-input embedding
        :param inference_lim: number of inferences
        :param max_seq_len: maximum sequence length being process (attention context size)
        :param top_k: top k sampling value
        :return: the predicted sequence
        """
        x = x.transpose(1, 0).to(self.device)  # (N,T,) --> (T,N,)

        try:
            for _ in range(inference_lim):
                # Adds the prediction to the target sequence, updates the time values
                y = self.forward(x[-max_seq_len:], causal=True)  # list of Z (T,N,C*)
                y = stack([top_k_sampling(type_[-1], top_k) for type_ in y]).t()  # (N,Z)
                x = cat([x, y.unsqueeze(0)])  # (T+1,N,Z)
        except KeyError:  # bar embedding to high
            pass
        return x.transpose(1, 0)  # (N,T,)

class MultiEmbeddings(Module):
    """Multi-input module, taking several tokens as input, converting them to embeddings and
    concatenate them to make a single 'merged' embedding

    :param num_classes: number of classes for each token type
    :param embedding_sizes: sizes of each embedding type
    :param d_model: size of the final embedding, i.e. dimension of the transformer
    :param padding_idx: padding index, must be the same for each token type
    """
    def __init__(self, num_classes: List[int], embedding_sizes: List[int], d_model: int, padding_idx: int = 0):
        assert len(num_classes) == len(embedding_sizes), \
            f'The number of classes and embedding sizes must be the same ({len(num_classes)} and ' \
            f'{len(embedding_sizes)} were given)'
        super().__init__()
        self.embedding_layers = ModuleList([Embedding(num_classes[i], embedding_sizes[i], padding_idx)
                                            for i in range(len(num_classes))])
        self.proj = Linear(sum(embedding_sizes), d_model)

    def forward(self, x) -> Tensor:
        """

        :param x: Tokens sequences, shape: (L, N, Z)
        :return: Embeddings, as a tensor with a shape (L, N, E)
        """
        embeds = []
        for i, mod in enumerate(self.embedding_layers):
            embeds.append(mod(x[:, :, i]))
        x = cat(embeds, dim=-1)  # (L, N, sum(embedding_sizes))
        return self.proj(x)  # (L, N, E)

class MultiOutput(Module):
    """Multi-output module.

    :param num_classes: number of classes for each token type
    :param d_model: size of the final embedding, i.e. dimension of the transformer
    """
    def __init__(self, num_classes: List[int], d_model: int):
        super().__init__()
        self.output_layers = ModuleList([Linear(d_model, num) for num in num_classes])

    def forward(self, x) -> List[Tensor]:
        """

        :param x: Tokens sequences, shape: (L, N, E)
        :return: List of tensors of shape (L, N, *)
        """
        return [out(x) for out in self.output_layers]  # (L, N, *)

class AbsolutePositionalEncoding(Module):
    """ Module injecting positional information in the embeddings of a sequence.
    To be used at the beginning of a transformer network, before the first layers.

    :param d_model: embedding size
    :param max_len: max length of the sequences that will be treated
    :param dropout: dropout value
    """

    def __init__(self, d_model: int, max_len: int, dropout: float = 0.1):
        super().__init__()
        self.dropout = Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """ Adds positional encoding to a sequence

        :param x: input tensor, shape (sequence length, batch size, embedding size)
        :return the tensor with positional encoding
        """

        x = x + self.pe[:x.size()[0], :].to(x.device, dtype=x.dtype)
        return self.dropout(x)

def top_k_sampling(x: Tensor, k: int, temperature: int = None) -> Tensor:
    """Top K sampling

    :param x: input tensor of shape (N,C) or (T,N,C)
    :param k: k factor
    :param temperature: temperature for softmax
    :return: sampling results as (N)
    """
    x_copy = x.clone() / temperature if temperature is not None else x.clone()
    indices_to_inf = x < torch.topk(x, k)[0][..., -1, None]
    x_copy[indices_to_inf] = float('-inf')
    if x.dim() == 2:  # (N,C)
        return torch.multinomial(torch.softmax(x_copy, -1), 1).squeeze(-1)
    elif x.dim() == 3:  # (T,N,C)
        return stack([torch.multinomial(torch.softmax(xi, -1), 1).squeeze(-1) for xi in x_copy])

And the create the model :

embedding_sizes=[256, 128, 128, 192, 64, 128, 128, 64]  # I just put random number, the choice is up to you
model = MyTransformer(num_layers=4, num_classes=[len(voc) for voc in tokenizer.vocab], d_model=256, dim_feedforward=1024, max_seq_len=1024, embedding_sizes=embedding_sizes)

For your last question, by 443 and 580 do you mean the sum of the vocabularies ? And yes the size can change between different datasets, as the durations of files would be different, the length of the Bar vocab would also be different.

envilk commented 2 years ago

This is the list of output tokens generated before converting to MIDI:

[1, 25, 21, 5, 10, 54, 29, 8, 2, 29, 29, 9, 10, 22, 24, 1, 2, 30, 1, 9, 10, 30, 22, 1, 2, 30, 1, 9, 10, 29, 22, 1, 2, 31, 1, 9, 10, 38, 26, 1, 2, 29, 1, 9, 10, 50, 26, 1, 2, 32, 1, 9, 10, 18, 24, 3, 2, 32, 1, 9, 10, 34, 22, 1, 2, 32, 1, 9, 10, 58, 26, 2, 2, 32, 1, 9, 10, 39, 26, 2, 2, 4, 1, 9, 10, 46, 23, 1, 2, 8, 1, 9, 10, 34, 25, 9, 2, 8, 1, 9, 10, 41, 22, 6, 2, 8, 13, 9, 10, 51, 23, 4, 2, 12, 13, 9, 10, 42, 23, 2, 2, 14, 13, 9, 10, 26, 20, 4, 2, 14, 13, 9, 10, 41, 22, 3, 2, 14, 13, 9, 10, 43, 24, 2, 2, 18, 13, 9, 10, 19, 24, 3, 2, 18, 14, 9, 10, 20, 20, 2, 2, 18, 14, 9, 10, 41, 22, 4, 2, 18, 14, 9, 10, 41, 21, 2, 2, 26, 14, 9, 10, 46, 21, 3, 2, 26, 14, 9, 10, 65, 27, 4, 2, 28, 14, 9, 10, 29, 25, 4, 2, 28, 14, 9, 10, 39, 21, 32, 2, 30, 14, 9, 10, 46, 27, 2, 2, 30, 14, 9, 10, 29, 21, 9, 2, 2, 14, 9, 10, 41, 25, 10, 2, 2, 14, 9, 10, 41, 19, 8, 2, 6, 14, 1, 10, 43, 23, 5, 2, 6, 14, 9, 10, 44, 19, 1, 2, 6, 14, 9, 10, 50, 29, 5, 2, 2, 14, 9, 10, 41, 25, 5, 2, 6, 14, 9, 10, 46, 25, 3, 2, 6, 14, 9, 10, 22, 27, 4, 2, 12, 14, 9, 10, 48, 27, 4, 2, 14, 14, 9, 10, 26, 20, 4, 2, 14, 14, 9, 10, 38, 22, 7, 2, 14, 14, 9, 10, 37, 19, 7, 2, 18, 14, 3, 10, 38, 21, 25, 2, 18, 14, 9, 10, 38, 26, 6, 2, 20, 14, 9, 10, 42, 24, 6, 2, 22, 14, 9, 10, 49, 21, 6, 2, 26, 14, 9, 10, 51, 22, 6, 2, 28, 14, 9, 10, 29, 22, 6, 2, 30, 14, 9, 10, 31, 22, 7, 2, 2, 14, 9, 10, 41, 24, 4, 2, 6, 14, 9, 10, 47, 25, 4, 2, 6, 14, 9, 10, 40, 26, 2, 2, 10, 14, 9, 10, 48, 26, 6, 2, 14, 14, 9, 10, 62, 26, 3, 2, 14, 14, 9, 10, 41, 26, 5, 2, 14, 14, 9, 10, 41, 26, 4, 2, 14, 14, 9, 10, 48, 28, 4, 2, 14, 14, 9, 10, 43, 22, 9, 2, 14, 14, 9, 10, 48, 30, 9, 2, 18, 14, 9, 10, 57, 28, 9, 2, 18, 14, 9, 10, 61, 26, 1, 2, 22, 14, 9, 10, 57, 28, 3, 2, 26, 49, 9, 10, 38, 27, 3, 2, 30, 14, 9, 10, 43, 29, 4, 2, 2, 14, 9, 10, 44, 26, 3, 2, 2, 14, 13, 10, 46, 28, 9]

The GPT2 architecture I am using is GPT2RGA (the one used in your colab-notebooks directory), not the one in hugginface. Anyway, it has only one Embedding layer as well. I guess that multi-embedding is necessary in order to not flatten all data and for Octuple to make sense as eight token pack, right? (my input tensors are shaped (4, 1024), and without flattening they are shaped (4, 1024, 8))

Yeah, for my last question I mean for the sum. What happens is that for the same dataset I can get two different overall sizes in different executions. However, I don't think this is leading to any of the problems described above.

Natooz commented 2 years ago

Ok, my bad. Indeed these models are not appropriate for octuple, or even cp word or mimidi or any "multi-embedding" representation. They work by creating embeddings for every token individually (8 here, and of different sizes if you want), then aggregate them (Pitch, Velocity ...) with a "pooling" operation. This operation can either be a concatenation, mean, sum or anything else that aggregate the embeddings. In MusicBert and in the MultiEmbedding module above. the embeddings are concatenated into one big embedding which is then passed through a linear layer to obtain the final embedding (of the size of the model, d_model).

And the output must also decode tokens of every types individually, with different linear layers.

Then during training, one would compute several losses (one for every token type), which you can sum or mean before computing the gradients.

MusicBert paper have a good figure which shows how it works. Cp Word too. Don't hesitate if you have any questions or troubles !

If I have time I'll test the new Octuple tokenizer in the next few days

envilk commented 2 years ago

I get what you say... I might try this kind of architecture, but I don't know how much time it can take me :sweat_smile:

Anyway, do you think that the last KeyError was given because of this use?

I'll be looking forward to the test!

Natooz commented 2 years ago

Yes I am 99% confident this error was caused by the "flattening".

If this takes you too much time, maybe you could just switch to a 1D representation (Remi, Structured etc). I have currently things running, when it's done I'll try this version of octuple.

If this can help you, here is how to compute the several losses

# x is the input sequence, of shape (N,T,Z), T is sequence length, N batch size, Z the different token types
# y are the output logits, is a list of Z tensors of shape (T,N,C*) where C is the vocabulary size, and will vary depending on the token type (pitch, velocity etc...)
losses = []
for j in range(len(tokenizer.vocab)):
    losses.append(criterion(y[j].permute(1, 2, 0), x[..., j]))  # shapes (N,C,T) and (N,T), see Pytorch cross-entropy for details
loss = sum(losses)  # here we sum, but we could also have mean for instance
envilk commented 2 years ago

You were right. I used your transformer class and octuple with some changes, and it seems to work now. The only problem is that I am struggling with the predict function, how would you use it to generate a sequence without primer melody?

PS: Should I open another issue for this? Is this one too far?

Natooz commented 2 years ago

That's great to hear ! 😃

I guess without primer melody, you could either 1) use a SOS (Start Of Sequence) token, or; 2) give only the first note.

The model will then pick the first note based only on one SOS, meaning that after being trained it will probably always produce the same probability distribution $\mathrm{p}_i \left( \mathbf{x_1} \lvert x_0 \right) \in \mathbb{R}^{C_i}$ where $x_0$ is the SOS token, $C_i$ the number of classes / vocab size, for each $i$ representing token types (pitch etc) / vocabularies. I guess this strategy could work if this likelihood has a high entropy so that the first note is not always the same. Otherwise adding some dropout or a higher softmax temperature for this first step might help.

You could maybe select the first note with some random (weighted) procedure.

Autoregressive generation from scratch can be tricky because of this.

Please tell me if you encounter bugs, if (I hope) it's not the case I will probably release this new version in the next few days.

envilk commented 2 years ago

I get what you say. Maybe that's deeper to build than I could, but I will try my best these days :smile:

I'll report everything I find :+1:

envilk commented 2 years ago

Hi again!

Trying some stuff got this error when converting tokens to midi. It looks like "midi_tokenizer_base.py" is not able to process the "None" value of the Time signature parameter that is passed through "tokens_to_midi" in "octuple.py":

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_5271/2731609472.py in <module>
----> 1 converted_back_midi = tokenizer.tokens_to_midi(out[:, 1:].tolist()[0])

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    274             # Time Signature, adds a TimeSignatureChange if necessary
    275             if self.additional_tokens['TimeSignature'] and time_step[-1].value == 'None':
--> 276                 time_sig = self._parse_token_time_signature(time_step[-1].value)
    277                 if time_sig != (time_sig_changes[-1].numerator, time_sig_changes[-1].denominator):
    278                     current_time_sig_tick += (current_bar - current_time_sig_bar) * ticks_per_bar

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _parse_token_time_signature(token_time_sig)
    461         :return: the numerator and denominator of a time signature
    462         """
--> 463         numerator, denominator = map(int, token_time_sig.split('/'))
    464         return numerator, denominator
    465 

ValueError: invalid literal for int() with base 10: 'None'
Natooz commented 2 years ago

Hi :)

I just corrected it in 960cbfa8eac1750aec1fb95d623e3ab2a51370f1 (a really stupid bug ahah) Hoping to not find any other bugs, if so please tell me. And if after testing with generated tokens you don't encounter any other bug, please also report it so that I release this in the next version. :)

Natooz commented 2 years ago

Hi Env,

After a few tests I did not run into bugs, so I released the update in v1.2.0 ! If you get new bugs / crashes please re-open this issue or create a new one ! :)

BTW Octuple is pretty "demanding" in computer ressource, meaning the multi input / output requires a relatively high amount of model parameters (and therefore gpu). The original authors used 8 V100 (32GB vram), which is quite a lot. My results with one V100 weren't very good also, the model often producing errors like predicting Bar / Positions already passed (going backward in time). For smaller hardware / model size, representations like REMI / Structured are more suitable.

envilk commented 2 years ago

Amazing!

I will try to train for many epochs in an Amazon GPU, so when I have results I can tell you. The new version seems to work well!

envilk commented 2 years ago

Hi there @Natooz, hope you are doing good!

Been doing several tests and my results weren't too good either. What you said is true, it uses a lot of GPU to train. It took ~ 5 days to train for 10 epochs (batch_size = 24) with one RTX 3060 (12GB RAM). The results were 3 minutes generations with some structure and harmony, but not so much. Also, generations have a lot of silent parts, meaning that in 3 minutes can exists 3 or 4 kinds of structures with different melodies, which can be pretty unequal from the other.

I also realized that the Transformer code you sent me (which I still appreciate a lot!), has only encoder layers (like BERT). That made me think about 2 things,

Maybe by doing what I mentioned above, tunning the hyperparameters a little bit (embedding layers, etc), and training it in AWS (what is my next target), the performance could improve. Let's see what happens :smile:

Thank you in advance!

Natooz commented 2 years ago

Hi @envilk, thanks for the reaching back with your results !

IMHO this kind of compound representation is not ideal, as in symbolic music embeddings does not carry much information. Whereas in natural language the embedding of a word or subword can represent a large amount of information (eg the embedding of dog will tell how it looks like, does etc), the embedding of a single pitch value doesn't contain much meaning by itself. Hence I don't think it makes a lot of sense to mix embeddings of symbolic music tokens.

Now to adresse the sequence length and memory complexity issues, one can simply use linear transformers which have been proved to work well for music, and also use byte pair encoding (BPE). For the latest, we should remember that Transformer are usually used with vocabularies of large size for text (50k+), so I think that a well BPE encoded vocabulary for symbolic music should work, and at the same time build embeddings carrying more information by themselves. I ran some experiments that I have yet to analyze on that, the tokenizer is working, following the results I will probably release it.

About the Transformer code, using only encoder to generate music is the right thing, but I understand the confusion. If we stick to the definition from the original transformer paper, an encoder layer only applies self-attention while a decoder also applies encoder-decoder attention (also called cross-attention). In fact GPT is made of encoders (according to the definition above), but the authors called it a "decoder" model as it is used a decoder. It brings confusion, whereas GPT2, BERT etc... they are actually almost all the same, just piles of encoder layers. Example in the code of GPT2 from Hugging Face. (add_cross_attention is not used actually, there is only self attention). The Decoder layer is intended to receive an input sequence, and hidden states from the encoder. In practice it is mostly used for seq2seq architectures, typically used for question answering (QA), neural machine translation (NMT) or text summarization tasks. Cross-attention can also be used for multimodal application as in Perceiver

The PyTorch Transformer implementation follows the original definition of encoder and decoder, so using only encoder layers will give you something like GPT or BERT.

Hopping this helps ! 😃