preprocess_data out of memory

OterLabb commented 6 years ago

Hi, im trying to preprocess 12 classes with approximately 400 samples in each. In total ~4500 samples. Each is a mono wav file, 2 seconds long. Doing this I run out of memory and python crashes.

I tried to increase the page file from 5 gb to 10 gb, but it still crashes. Running on windows 10, i5-8250U, 8 gb ram. Any help appreciated :)

Logfile: https://pastebin.com/PAEvzUpP

drscotthawley commented 6 years ago

HI @OterLabb , I'm not sure why you're getting that error. The dataset you're using should be fine.
The IDMT guitar effects dataset that I used is very similar -- 12 classes of 2 second mono audio each. I'm not a Windows user so I don't know how that might affect things.

Couple things you could try:

Try running with "--clean". It's a flag for those of us who have "clean" data; it disables a lot of the automatic re-sizing that other users with heterogeneous datasets had requested.
Try running in serial instead of parallel mode. I should add a command-line flag for this, but for now it's inside the preprocess_data.py code where "parallel = True" is set. Change that to False. It will run slower, but...it should use 1/8th the memory, and those multiprocessing errors will go away.

OterLabb commented 6 years ago

Thanks for answering! --clean flag went a little further, but eventually crashed with the same memory issues.

Setting the parallel flag to False gave this error:

`C:\Users\user\Downloads\PANETTI\PANETTI>python preprocess_data.py Will be resampling at 44100 Hz Will be imposing 80-20 (Train-Test) split Shuffling ordering Finding max shape... Padding all files with silence to fit shape: Channels : 1 Samples : 88200

class_names = ['aakersanger', 'busksanger', 'flekksnipe', 'honsehauk', 'myrsanger', 'rorsanger', 'sivsanger', 'spurvehauk', 'stjertmeis', 'strandsnipe', 'trostesanger', 'vannsanger'] 8 CPUs detected: Parallel execution across 8 CPUs

Traceback (most recent call last): File "preprocess_data.py", line 185, in dur=args.dur, clean=args.clean, out_format=args.format) File "preprocess_data.py", line 164, in preprocess_dataset convert_one_file(task, file_index, args) TypeError: convert_one_file() missing 14 required positional arguments: 'nb_classes', 'classname', 'n_load', 'dirname', 'resample', 'mono', 'already_split', 'n_train', 'outpath', 'subdir', 'max_shape', 'clean', 'out_format', and 'file_index'`

I guess I could get it to work on another computer with more ram, but say if I wanted to preprosess 100's of classes and many GB's with wav files. Is there a way to prosess them in batches like in training?

Edit: Seems like splitting them up and preprosessing a few at a time works fine :)

drscotthawley commented 6 years ago

Hi, I don't understand how those positional arguments could be missing. They're right there in the code. That's strange.
Sounds like it was a memory issue after all. Glad you got it working!

drscotthawley / panotti

preprocess_data out of memory #25