Parallel processing of files?

haasn commented 4 years ago

I have a 16 core system. It's quite slow and wasteful to analyze all files single-threaded when trying to tag large folders.

How difficult would it be to add a bit of parallel processing to speed this process up?

Moonbase59 commented 4 years ago

Probably hefty, and I haven’t yet done any parallel stuff in C/C++. Also, loudgain doesn’t know anything about folder traversal but instead considers the list of filenames given to it as an "album".

Since mass-tagging usually only needs to be done once, I never really bothered (a 150,000 track/2 TB collection took 3 days 20 hours on one of my machines last time I checked). Usually, one only needs to loudgain the new albums, so it won’t take too long.

One could of course start more than one loudgain process and somehow do large collections in parts. This could be achieved by modifying the rgbpm script, or a little Python wrapper.

For now, I’ll set this to "wontfix" but leave it open, since it’s not top priority and would be difficult regarding album aggregation and console output to the user.

Thinking about it, I might come up with a rgbpm-style Python example that would parallelize processing whole folders (=albums). This might turn out to be the best compromise and surely speed up mass-tagging considerably on multi-core machines.

If you want to give it a try yourself, check out Python’s subprocess and multiprocessing modules. These might be a good starting point.

Moonbase59 commented 4 years ago

You might want to try out the new bin/rgbpm2 Python script. It features up to 32 parallel processes on a per-folder-per-filetype basis and even handles folder exclusions and following links. :grin:

usage: rgbpm2 [-h] [-v] [-n 1..32] [-f] folder [folder ...]

ReplayGain album folders recursively, using loudgain.
Files of the same type in the same folder are considered an album.

Supported audio file types: .aif, .aiff, .ape, .asf, .flac, .m4a, .mp2,
.mp3, .oga, .ogg, .opus, .spx, .wav, .wma, .wv

positional arguments:
  folder                Path of a folder of audio files.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -n 1..32, --numproc 1..32
                        set max. number of parallel processes (default: 4)
  -f, --follow-links    follow links (default: False); use with care!

Please report any issues to https://github.com/Moonbase59/loudgain/issues.

rgbpm2 is not as verbose as rgbpm but has better error handling. The number of parallel processes defaults to the number of CPU cores and can be adjusted. This should greatly speed up loudgaining larger collections with many album folders.

_Note: Due to my personal folder structure, rgbpm2 will exclude folders that end with the text [compilations]. Adapt excludes = […] and extensions = {…} in the code to suit your personal tagging needs._

I do appreciate testing feedback and bug reports!

Moonbase59 commented 4 years ago

Test run of rgbpm2 on an old Ubuntu 14.04 4-core system using the (slow) loudgain.static:

Excluded 6840 folders.
Working on 12406 folders containing 132315 files, requiring 12415 tasks.
Found files of type: .flac, .m4a, .mp2, .mp3, .ogg, .wma

Time using rgbpm: 5 days, 2½ hours (122½ hours) Time using rgbpm2: 1½ day (36 hours)

Parallel processing helps indeed. Thanks for the feature request, @haasn!

McOrvas commented 3 years ago

I also see no need to adjust loudgain here. You can start parallel tasks with one simple command (beside the possibility to use rgbpm2), I have used the following to process my whole music library:

find . -type f -iname "*.flac" -printf '%h\0' | sort -zu | parallel -0 loudgain -a -k -s e "{}/*.flac"

Edit: To check the output of loudgain for errors, it makes sense to write it to a log file, for example:

find . -type f -iname "*.flac" -printf '%h\0' | sort -zu | parallel -0 loudgain -a -k -s e "{}/*.flac" &> loudgain-flac.log

dbedrenko commented 3 years ago

@McOrvas But I thought the way loudgain worked was that it must analyse all files first, determine how much to adjust each based on that analysis, and then write to the files. If you paralellise in the way that you did it, each loudgain thread doesn't know about the results of the other threads: it would only adjust within its batch.

McOrvas commented 3 years ago

@dbedrenko The command above starts one loudgain thread per folder for all flac files in it. So per folder (in my case CD/album) exactly one thread is started which processes all files in it. With this command the parallelization is for folders/albums, not for files!

Of course you are absolutely right that you should not start a loudgain thread per file (at least if you want album gain tags). You could reduce the command to find all folders instead of flac files and remove the whole "sort -zu" sub command, but in my case not all folders include flac files and with this term loudgain is only called for folders with at least one flac file in it.

dbedrenko commented 3 years ago

@McOrvas Ah, that's for album gain, I see. I do track gain so unfortunately parallel is unsuitable for my purposes.

McOrvas commented 3 years ago

@dbedrenko To be honest, I don't understand the problem. The decision to use album or track replay gain takes place in the playback application, so you don't have a disadvantage when you add the three additional album tags to the files. But even if you, for whatever reason, only want to add track gain tags, you can use the parallelization above (with -r instead of -a).

If you really only want track gain tags, loudgain can process each file independently of the others, so it would be possible to start one loudgain instance for every file instead of folder. But I don't think that this is a good idea because of the enormous number of program calls.

dbedrenko commented 3 years ago

@McOrvas I may be wrong about the way loudgain works, let me demonstrate the way I think it works with a scenario:

quiet_album/quiet_song.mp3 
quiet_album/quiet_song2.mp3 

loud_album/loud_song.mp3
loud_album/loud_song2.mp3

Job A and B are tasked with processing quiet_album and loud_album respectively. Job A has no idea about the loud album in my collection: it won't normalise loud_album to be less loud. Job A will only normalise within its album, and likewise for Job B.

So I'm not worried about what tags are added: rather the ReplayGain values that are calculated to be in those tags.

McOrvas commented 3 years ago

@dbedrenko I think I'm beginning to understand where the misunderstanding comes from. Loudgain does not calculate the album or track gain values in relation to all other music files in your collection but only to a technically and globally defined value: -18 LUFS. So for track gain loudgain only needs to know a single track and for album gain it needs to know the other tracks of the album, but no more!

If it would calculate the values in relation to other tracks/albums, you would have to run it again for all your files each time you add a single track. And your and my collection wouldn't have the same loudness. But exactly this is the goal of replay gain: The absolutely same loudness for all tracks/albums.

dbedrenko commented 3 years ago

@McOrvas It would be helpful if someone would confirm that that is indeed the case. The software mentions that there is a common usecase for recalculating your entire collection.

AustinSHend commented 3 years ago

I interpreted the OP's post as wanting to parallel scan multiple files from one album at the same time using album mode, not parallel scan multiple albums at the same time (trivial through several methods).

I'm not sure if this is actually even possible (to my knowledge, foobar2000 is the only program I've ever seen do parallel album mode replaygain), but I'd like to add that I would greatly appreciate that functionality as stated. Calculating replaygain is probably 20% of the time I spend when importing a new album into my library, and it could be nearly instant if I could use my 16 core system effectively.

Obviously there are a few hacky workarounds to accomplish this efficiency, e.g. import 16 albums, then run each one through a replaygain thread at the same time before moving into the library, or just spam the command on my entire collection and abort albums that already have replaygain, but it would be ideal to have this efficiency natively.

Moonbase59 / loudgain

Parallel processing of files? #13