Make PrettyMIDI object serializable efficiently

bzamecnik commented 7 years ago

Parsing MIDI is rather slow (eg. music21 is pretty slow, pretty-midi is better, but still not pretty fast) and we might want to perform queries on the parsed data or use that multiple times (eg. as training data for an ML model). Besides trying to optimize the parsing stage another option is to cache the parsed results, eg. by picking or a different form of serialization. In music21 it seems the objects are not serializable at all. In pretty-midi I was able to serialize the PrettyMIDI object and load it back, but the problem is that the serialized form several orders bigger than the original MIDI (just prohibitively big - several MB for a few kB of MIDI).

The subject of this issue is to serialize only vital information that can be used to restore the object while keeping any post-processing still faster than parsing the MIDI again.

Originally I though pickle takes derived properties like get_piano_roll() abut after a very superficial inspection it seems some internal properties like __tick_to_time take much space. I can investigate and measure it in more detail.

The possible solution might be to explicitly provide object for serialization and possibly compress them (eg. dense matrix to sparse) before serialization and decompress after serialization.

The goal is to reduce the pickled size to something comparable to MIDI (or one or two orders bigger) and also to keep the (de)serialization time low.

craffel commented 7 years ago

Maybe it would be first to define what you need in terms of speed, memory usage, disk space, etc. I have done a few projects which involve parsing/analyzing/using ML models on O(100,000) MIDI files with pretty_midi on commodity hardware with no issues.

it seems some internal properties like __tick_to_time take much space.

Yes, this is cached to make things more efficient. A few megabytes should be no big deal in memory, I think :)

cifkao commented 5 years ago

Here is a comparison of creating a PrettyMIDI object from MIDI vs. unpickling it. The first file is very small (8 bars of monophony), the second one is a full song.

In [1]: %timeit pretty_midi.PrettyMIDI('small.mid')
3.42 ms ± 42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [2]: %timeit with open('small.pickle', 'rb') as f: pickle.load(f)
236 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit pretty_midi.PrettyMIDI('smoke.mid')
367 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit with open('smoke.pickle', 'rb') as f: pickle.load(f)
25.4 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here are the file sizes:

 514 small.mid           56K smoke.mid
126K small.pickle       2.1M smoke.pickle

cifkao commented 5 years ago

@craffel Would it be viable to integrate the NoteSequence protobuffer from Magenta into pretty_midi (either by adding conversion methods or by directly including it as part of the internal representation)? It seems that the proto mimics the design of pretty_midi, and the code for converting back and forth already exists. However, it could be inconvenient to use Magenta directly, since it depends on a lot of other packages (e.g. a specific version of TensorFlow).

craffel commented 5 years ago

No, I don't think so. pretty_midi does not depend or rely on NoteSequence in any way; the the dependency graph only points in one direction. If it's hard to use NoteSequence because of all of Magenta's dependencies, I'd suggest you advocate for Magenta to factor out NoteSequence into a separate library.

cifkao commented 4 years ago

note-seq is now a separate library with reduced dependencies, and NoteSequence has been fixed to support efficient pickling!

Now we can do this:

import pretty_midi, note_seq, pickle

pm = pretty_midi.PrettyMIDI('file.mid')

# PrettyMIDI -> NoteSequence -> pickle
ns = note_seq.midi_to_sequence_proto(pm)
with open('file.pickle', 'wb') as f:
    pickle.dump(ns, f)

# pickle -> NoteSequence -> PrettyMIDI
with open('file.pickle', 'rb') as f:
    ns = pickle.load(f)
pm = note_seq.sequence_proto_to_pretty_midi(ns)

craffel / pretty-midi

Make PrettyMIDI object serializable efficiently #133