Natooz / MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶
https://miditok.readthedocs.io/
MIT License
665 stars 81 forks source link

`tokenizer.train` supporting Unigram and WordPiece #161

Closed Natooz closed 5 months ago

Natooz commented 5 months ago

Following #154

Also a few minor updates: PAD_None token is now mandatory, stronger training tests, pad_token_id property, has_bpe replaced by is_trained property


📚 Documentation preview 📚: https://miditok--161.org.readthedocs.build/en/161/

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 80.69815% with 94 lines in your changes are missing coverage. Please review.

Project coverage is 90.50%. Comparing base (c2c8cb3) to head (df1a94e). Report is 1 commits behind head on main.

:exclamation: Current head df1a94e differs from pull request most recent head 068aa18. Consider uploading reports for the commit 068aa18 to get more accurate results

Files Patch % Lines
miditok/midi_tokenizer.py 81.95% 37 Missing :warning:
miditok/tokenizer_training_iterator.py 21.73% 18 Missing :warning:
tests/test_train.py 90.17% 11 Missing :warning:
benchmarks/utils.py 0.00% 9 Missing :warning:
tests/test_tokenize.py 70.37% 8 Missing :warning:
miditok/classes.py 88.57% 4 Missing :warning:
benchmarks/__init__.py 0.00% 2 Missing :warning:
miditok/tokenizations/mumidi.py 0.00% 2 Missing :warning:
miditok/tokenizations/octuple.py 0.00% 2 Missing :warning:
miditok/data_augmentation/data_augmentation.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #161 +/- ## ========================================== - Coverage 91.06% 90.50% -0.57% ========================================== Files 35 37 +2 Lines 5214 5474 +260 ========================================== + Hits 4748 4954 +206 - Misses 466 520 +54 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.