Improve performance of MIDI extraction

capital-G commented 3 years ago

Currently the extraction of the MIDI files take a couple of hours which is a bit much because we only have 100k examples to load. I tried to improve the speed by using https://github.com/jmcarpenter2/swifter which promises to parallelize the code and it indeed uses 100% of the CPU but it seems it is not really speeding up the process but therefore introduces a couple of problematic dependencies.

It is also worth to take a look at

telephon commented 3 years ago

Maybe pandas is slow for this size of data?

capital-G commented 3 years ago

I worked with 100 times bigger dataframes and it worked really fast, it has to do something with the extraction of the midi files, how the information of the midi files gets written to the dataframe and how the parallelization of this works - i need to profile this properly.

telephon commented 3 years ago

Maybe this is a hint? https://github.com/cuthbertLab/music21/blob/ebc47a1fbc9d65981bc52928593b4b564ae96cd5/music21/converter/__init__.py#L30

telephon commented 3 years ago

One could include the rendered picke objects in the directory for downlowd?

capital-G commented 3 years ago

Sharing pickle files is not so a good idea because they are bound to a python version and could also be used to distribute malicious code. Although at this point of a meta analysis I avoid using music21 because it is really slow compared to pretty_midi.

telephon commented 3 years ago

How do I know in what state a python process is? When I call the code in parse_midi, it doesn't post anything, probably it is rendering but I don't know. The next section then calling midi_df.sample(15) throws an error (ValueError: a must be greater than 0 unless no samples are taken) which indicates that the parsing hasn't completed.

Perhaps we should select a subset of the library, maybe a specific genre?

capital-G commented 3 years ago

The problem in ML is that one has to work with large datasets and with python which is a notorious slow language. I run those cells at night because even on an 2019 i9 it took 4 or 5 hours to process everything. We could write a parser in C but this is premature optimization IMO and is not how data scientists work b/c with caching we only need to run this once.

While a cell is running a [*] should appear next to it - if you want to process another cell in the meantime its not possible because in python only one thing at a time can run - but it is scheduled to run after the current running cell.

Maybe we can get some funding to rent GCP machines, here is a pricelist for GPU/hour

The big advantage of GCP is that one can easily attach and detach GPUs on a machine - while you code you work with none or 1 GPU but when you train you reboot the machines with 8 GPUs to get the results quicker for the same price - thats why most companies do not use their own GPUs anymore.

Although this would introduce yet another exhaustive topic and a vendor endorsement with cloud computing.

telephon commented 3 years ago

Testing with the limited subset, I noticed that sometimes it is very quick, sometimes it is really endless. There must be midifiles that are much harder to parse – which is unsurprising, but the difference is surprisngly huge (2 sec vs. 4 minutes).

capital-G commented 3 years ago

Thanks for the hint - this is indeed an interesting problem and I will research this with a call graph https://www.jetbrains.com/help/pycharm/profiler.html#view-graph

But such high differences seems more be due to caching - everytime you call the method with the same number the cached results will be used (although I now remember that this will destroy the later feather file in which we store the extracted MIDI files so I must fix this in the PR as well) - if you use an unseen number a new cache file will be calculated.

Can you please verify that this high variance is not due the caching system?

telephon commented 3 years ago

Yes, I've just tried to narrow it down and found that it is caching. When I set limit_samples = 1000 then it loads them from the cache, when I rise the number to 1001, it becomes slow.

telephon commented 3 years ago

It might be good to build the cache incrementally? But even loading only 1001 files takes a long time ...

capital-G commented 3 years ago

Actually I dont want to spend too much on the subsets of suff because this only introduces new nightmarish complexity (see https://github.com/capital-G/musikinformatik-sose2021/pull/30#issuecomment-822673082) only to avoid some computation time which in the end will be due if one wants to train something more complex.

telephon commented 3 years ago

You are right. I'm rendering overnight.

Is it correct that there are several processes?

capital-G commented 3 years ago

Yeah, thats the idea of https://github.com/jmcarpenter2/swifter - it tries to parallelize tasks as good as possible - I havent used it before and I am still unsure if it is really worth it because I have some doubts regarding the efficency, but maybe some benchmarks can help here.

telephon commented 3 years ago

it seems that overnight the process somehow ended but didn't complete. Started it again, but I suspect it has to do all the rendering again?

In particular it would be useful to know how far it is in the process (like say for every 3000 midi files post a message)

capital-G commented 3 years ago

Some profiling

import glob

import music21
import pretty_midi as pm
import pandas as pd

df = pd.DataFrame({'file_path': glob.glob("/Users/scheiba/github/musikinformatik_sose2021/datasets/lmd/lmd_full/*/*.mid")})
print(f'Found {len(df)} files')

def read_pm(file_path):
    try:
        pm.PrettyMIDI(file_path)
    except Exception:
        pass

def read_music21(file_path):
    try:
        music21.converter.parse(file_path)
    except Exception:
        pass

for _, row in df.head(100).iterrows():
    read_music21(row['file_path'])

using pm:

using music21 (this gets really slow - after 15mins I stopped it and it was only at 96/100):

music21

builtin.id has an incredible call count of 1370593078

But the real evil is

I think this makes music21 unusable for our experiments.

There is a discussion how use pretty_midi with note-seq of the magenta team: https://github.com/craffel/pretty-midi/issues/133 - make this is worth considering

edit: so note-seq is just using pretty midi under the roof - nontheless it has the nice function called midi_file_to_drum_track, see https://github.com/magenta/note-seq/blob/a71716327619d9d995cce3ca1e0d8442ab2e0d73/note_seq/drums_lib.py - a function I was working on for the course to transfer the midi file to a grid

telephon commented 3 years ago

This looks like a good idea, even though it is a dependency of a different kind. They have functions for other aspects of midi files as well, so it may not lock us in too much.

I've let the current system render, and it so far has taken about 19 hours on a relatively new computer, still going. So I agree that we need to do something about it.

capital-G commented 3 years ago

@telephon #33 should work much better now - if not please reopen this issue. It is important to install the new dependencies - see https://capital-g.github.io/musikinformatik-sose2021/docs/course-info/setup.html#installing-dependencies

telephon commented 3 years ago

This works now a lot better, took about two hours as expected.

capital-G / musikinformatik-sose2021

Improve performance of MIDI extraction #24