Open bzamecnik opened 7 years ago
Maybe it would be first to define what you need in terms of speed, memory usage, disk space, etc. I have done a few projects which involve parsing/analyzing/using ML models on O(100,000) MIDI files with pretty_midi
on commodity hardware with no issues.
it seems some internal properties like __tick_to_time take much space.
Yes, this is cached to make things more efficient. A few megabytes should be no big deal in memory, I think :)
Here is a comparison of creating a PrettyMIDI
object from MIDI vs. unpickling it. The first file is very small (8 bars of monophony), the second one is a full song.
In [1]: %timeit pretty_midi.PrettyMIDI('small.mid')
3.42 ms ± 42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [2]: %timeit with open('small.pickle', 'rb') as f: pickle.load(f)
236 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %timeit pretty_midi.PrettyMIDI('smoke.mid')
367 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit with open('smoke.pickle', 'rb') as f: pickle.load(f)
25.4 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here are the file sizes:
514 small.mid 56K smoke.mid
126K small.pickle 2.1M smoke.pickle
@craffel Would it be viable to integrate the NoteSequence
protobuffer from Magenta into pretty_midi
(either by adding conversion methods or by directly including it as part of the internal representation)? It seems that the proto mimics the design of pretty_midi
, and the code for converting back and forth already exists. However, it could be inconvenient to use Magenta directly, since it depends on a lot of other packages (e.g. a specific version of TensorFlow).
No, I don't think so. pretty_midi
does not depend or rely on NoteSequence
in any way; the the dependency graph only points in one direction. If it's hard to use NoteSequence
because of all of Magenta's dependencies, I'd suggest you advocate for Magenta to factor out NoteSequence
into a separate library.
note-seq is now a separate library with reduced dependencies, and NoteSequence
has been fixed to support efficient pickling!
Now we can do this:
import pretty_midi, note_seq, pickle
pm = pretty_midi.PrettyMIDI('file.mid')
# PrettyMIDI -> NoteSequence -> pickle
ns = note_seq.midi_to_sequence_proto(pm)
with open('file.pickle', 'wb') as f:
pickle.dump(ns, f)
# pickle -> NoteSequence -> PrettyMIDI
with open('file.pickle', 'rb') as f:
ns = pickle.load(f)
pm = note_seq.sequence_proto_to_pretty_midi(ns)
Parsing MIDI is rather slow (eg.
music21
is pretty slow,pretty-midi
is better, but still not pretty fast) and we might want to perform queries on the parsed data or use that multiple times (eg. as training data for an ML model). Besides trying to optimize the parsing stage another option is to cache the parsed results, eg. by picking or a different form of serialization. In music21 it seems the objects are not serializable at all. In pretty-midi I was able to serialize thePrettyMIDI
object and load it back, but the problem is that the serialized form several orders bigger than the original MIDI (just prohibitively big - several MB for a few kB of MIDI).The subject of this issue is to serialize only vital information that can be used to restore the object while keeping any post-processing still faster than parsing the MIDI again.
Originally I though pickle takes derived properties like
get_piano_roll()
abut after a very superficial inspection it seems some internal properties like__tick_to_time
take much space. I can investigate and measure it in more detail.The possible solution might be to explicitly provide object for serialization and possibly compress them (eg. dense matrix to sparse) before serialization and decompress after serialization.
The goal is to reduce the pickled size to something comparable to MIDI (or one or two orders bigger) and also to keep the (de)serialization time low.