Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶
https://miditok.readthedocs.io/
MIT License
626 stars 80 forks source link

Handling Incompatible Files During `split_files_for_training` #176

Closed Kinyugo closed 1 month ago

Kinyugo commented 1 month ago

I'm using miditok to split my files for training purposes. However, I've encountered some incompatible files that prevent split_files_for_training from functioning correctly.

To address this, I implemented a pre-processing step to filter out files that couldn't be tokenized by miditok. Despite this, some incompatible files are still causing the whole process to fail.

I'd like to request guidance on how to effectively skip these incompatible files during the split_files_for_training stage.

Natooz commented 1 month ago

Hi 👋 Thank for the insight! Could you provide files that cause the method to crash? Corrupted files should be skipped when loading them fail so I assume the issue comes from the code itself. That would allow me to detect and fix what's wrong.

Also I'll move this method (and the associated split_score_per_note_density get_average_num_tokens_per_note split_dataset_to_subsequences methods) from the "PyTorch_data" module to the "utils" module (or maybe a dedicated "split_utils" module) of the lib in the next update (released soon) as it doesn't have to rely on PyTorch and should be able to be used with any DL framework.

Kinyugo commented 1 month ago

Here are some examples. error_files.tar.gz

Here is the error that I am getting:

  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py", line 122, in split_files_for_training
    score_chunks = split_score_per_note_density(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py", line 210, in split_score_per_note_density
    bar_ticks = get_bars_ticks(score)
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/utils.py", line 669, in get_bars_ticks
    if time_sigs[-1].time != max_tick:
       ~~~~~~~~~^^^^
IndexError
Kinyugo commented 1 month ago

I think I may have found a temporary fix for the above error. Instead of copying the file. I dump the loaded file using Score.dump_midi method. This seems to ensure that the required metadata is present.

def copy_if_valid(src_path, dest_dir, tokenizer) :
    try:
        # attempt to load and tokenize the midi file
        score = Score(src_path)
        tokenizer(score)

        # copy the file maintaining the directory structure
        dest_path = dest_dir / src_path.relative_to(src_path.parts[0])
        dest_path.parent.mkdir(parents=True, exist_ok=True)
        score.dump_midi(dest_path) # use this instead of something like shutil.copy2
    except Exception as e:
        print(f"Error processing {src_path}: {e}")

However, there are other errors that occur that should also be handled. Here is an example:

{
    "name": "ZeroDivisionError",
    "message": "division by zero",
    "stack": "---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
Cell In[16], line 7
      5 except Exception as e:
      6     print(f\"Failed to process {file}: {e}\")
----> 7     raise e 
      8     break

Cell In[16], line 4
      2 for file in filepaths:
      3     try:
----> 4         split_files_for_training([file], tokenizer, Path(tmpdir), max_seq_len=16384)
      5     except Exception as e:
      6         print(f\"Failed to process {file}: {e}\")

File ~/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py:94, in split_files_for_training(files_paths, tokenizer, save_dir, max_seq_len, average_num_tokens_per_note, num_overlap_bars, min_seq_len)
     88     return [
     89         path
     90         for path in save_dir.glob(\"**/*\")
     91         if path.suffix in SUPPORTED_MUSIC_FILE_EXTENSIONS
     92     ]
     93 if not average_num_tokens_per_note:
---> 94     average_num_tokens_per_note = get_average_num_tokens_per_note(
     95         tokenizer, files_paths[:MAX_NUM_FILES_NUM_TOKENS_PER_NOTE]
     96     )
     98 # Determine the deepest common subdirectory to replicate file tree
     99 root_dir = get_deepest_common_subdir(files_paths)

File ~/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/split_utils.py:304, in get_average_num_tokens_per_note(tokenizer, files_paths)
    302 if tokenizer.one_token_stream:
    303     num_notes = score.note_num()
--> 304     num_tokens_per_note.append(len(tok_seq) / num_notes)
    305 else:
    306     for track, seq in zip(score.tracks, tok_seq):

ZeroDivisionError: division by zero"
}
Natooz commented 1 month ago

Thank you! The issue comes from the fact that these MIDIs do not have default time signatures. This is rare cases and I assumed symusic would automatically attribute the default 4/4 time signature. It is done in MidiTok only when tokenizing, here the method didn't handle this case which is the case now in #175. It will be merged soon and you'll be able to get the fix by installing MidiTok and symusic (needed for both until symusic v0.5.0 is released) from git.

I just red you last comment, I think it did work because when dumping a default time signature is written in the file too.

Thank you for the report!

Natooz commented 1 month ago

(working on your last error when no note is present)

Kinyugo commented 1 month ago

Thanks for your time. Will you be including an option to skip erroneous files during splitting?

Natooz commented 1 month ago

Thanks for your time. Will you be including an option to skip erroneous files during splitting?

I prefer not too, as these errors just shouldn't happen and should be fixed/handled by MidiTok. Skipping them would be playing blinds :)

Kinyugo commented 1 month ago

That makes sense. Perhaps a warning to the user would be good. Then maybe one can remove the files or inspect the data if too many files have warnings.

Natooz commented 1 month ago

Ok I fixed the second issue which were caused by some methods not handling empty MIDIs (no tracks and/or no notes) in #175.

I'm currently testing locally with a large number of files from the Lakh dataset, I caught some other issues that's I'll fix before merging the branch. Edit: that was a silly error with Octuple and the maximum number of bars to tokenize. Everything should pass now.

Natooz commented 1 month ago

The fixes are merged in to the main branch!

You can run install from git to get them locally:

pip uninstall miditok symusic
pip install git+https://github.com/Yikai-Liao/symusic
pip install git+https://github.com/Natooz/MidiTok
Kinyugo commented 1 month ago

Did the location of split_files_for_training change?

ImportError: cannot import name 'split_files_for_training' from 'miditok.pytorch_data' (/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/pytorch_data/__init__.py)
Natooz commented 1 month ago

Yes, I moved them in the utils module as commented above :) miditok.utils.split_files_for_training

Kinyugo commented 1 month ago

It seems that the empty tracks are still not handled.

  File "/home/kinyugo/learning/ml/audio_generation/melo_mamba_mmm/melo/scripts/data_preprocessing.py", line 58, in cli_main
    split_files_for_training(
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/split_utils.py", line 123, in split_files_for_training
    score_chunks = split_score_per_note_density(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kinyugo/miniforge3/envs/torch/lib/python3.11/site-packages/miditok/utils/split_utils.py", line 221, in split_score_per_note_density
    tpb = num_tokens_per_bar[bi]
          ~~~~~~~~~~~~~~~~~~^^^^
IndexError: list index out of range
Natooz commented 1 month ago

Could you provide the file causing the issue? I tried with an empty file (no tracks and with tracks with no notes/controls) without being able to reproduce it.

Kinyugo commented 1 month ago

This one causes division by zero error. I haven't tracked the other one yet. error_files_2.tar.gz

Natooz commented 1 month ago

Thank you! I tried to reproduce the error without success. Could you also share the tokenizer configuration you are working with? In the meantime, I am testing with a larger amount of files from the Lakh dataset hoping to catch erroneous files.

Kinyugo commented 1 month ago

Here it is:

def make_tokenizer() -> MMM:
    tokenizer_config = TokenizerConfig(
        use_tempos=True,
        use_programs=True,
        use_time_signatures=True,
        use_chords=True,
        use_rests=True,
        base_tokenizer="REMI",
        special_tokens=["PAD", "BOS", "EOS"],
    )

    return MMM(tokenizer_config)
Natooz commented 1 month ago

Thank you. I managed to reproduce the error with several files from the Lakh dataset. I'll continue to work on it tomorrow and push the fixes. Apologies for the inconvenience.

Natooz commented 1 month ago

@Kinyugo this time it should work, I tested with multiple combinations and about 40k files without any error. 🙌 You can reinstall it from git

Kinyugo commented 1 month ago

Thanks. I have tried it and now it works.