Stazed / rakarrack-plus

Rakarrack plus LV2s
GNU General Public License v2.0
36 stars 8 forks source link

Slow startup on Raspberry Pi OS (Buster) #24

Open Rippert opened 3 years ago

Rippert commented 3 years ago

I've noticed that Rakarrack-Plus takes about one and a half minutes to start up on my Raspberry Pi 4. Not a bug, necessarily, but When I didn't know to expect the wait, it seemed like one.

It might be a good idea to put up a splash screen of some kind while Rakarrack-Plus is initializing.

Stazed commented 3 years ago

I know nothing about Raspberry Pi. A quick google indicates it "is a tiny and affordable computer". Can you even run real-time audio on it? Does Jack even run on it? Do other programs load slowly on it? I have a very fast 8 core with 16mb ram that gets bogged down with some of the effects. I do test on a 10 year old single core with 2mb ram and it only takes a few seconds to start up. If you can run real-time audio on it, this sounds like a bug or incompatibility of some sort. Since I do not have one of these, I cannot test and track down the issue. I am not inclined to add a splash screen for a single use case even if it is not a bug. Perhaps you or someone else with a Pi can track down the reason for excessive startup time and let me know, so it can be fixed if possible.

Rippert commented 3 years ago

Yes, I have no problem running realtime audio with Jack on my RPi4. I was just testing it, and I was able to run Guitarix and Rakarrack-Plus simultaneously at 48000 kHz with 16 frames/period, 2 periods/buffer. That's only about 1.8 ms of latency, plus about another 1 ms from the USB interface. Had to do some CPU affinity stuff and minimize the GUIs to get that without xruns, but it works fine at 64 frames/period without all that effort.

The RPi4 has a quad core ARM processor running at 1.5 GHz, more if you overclock it.

Actually Rakarrack-Plus runs fine, much better than the old, non-plus, version. It just takes a long time to start up. I did build it from source, so maybe I could help track down what's causing it to start slowly.

Can you tell me what is going on in the code before the main graphics panel shows up? I watched it in top when it was starting up and it's running at about 95% on one processor the whole time before the window shows up. Actually drops down quite a bit after that.

I didn't try to do any optimiztion for my CPU in the make build process, so that is something I could try.

Stazed commented 3 years ago

Now that you mentioned without optimizations, I think I know what is going on here. Try building the wip branch. There is a fix in there for a silly thing I did that you probably would not encounter if you optimized. Try building the branch without optimizations again and see if that fixes it. Building with opts would probably work on master as well. Let me know if that works. Thanks.

Rippert commented 3 years ago

Yeah, the wip branch took about 24s to start, the master branch took 1m 20s, so definitely better. I will try to optimize for an ARM cpu later and see if that gives me any more improvement.

Thanks, Ted

Rippert commented 3 years ago

I tried to append -mtune=cortex-a72 -march=native to the compilation flags, but it didn't seem to work. I don't really use cmake much so probably just doing it wrong.

Here's a typical gcc command while making your wip branch on my RPi4: /usr/lib/gcc/arm-linux-gnueabihf/8/cc1plus -quiet -I /usr/include/uuid -I /usr/include/freetype2 -I /usr/include/libpng16 -I /home/tedrippert/src/rakarrack-plus/src/UI -I /home/tedrippert/src/rakarrack-plus/build/src/UI -imultilib . -imultiarch arm-linux-gnueabihf -D_GNU_SOURCE -D VERSION="1.0.5" -D WEBSITE="github.com.Stazed.rakarrack.plus" -D PACKAGE="rakarrack-plus" -D DATADIR="/usr/local/share/rakarrack-plus" -D HELPDIR="/usr/local/share/doc/rakarrack-plus" /home/tedrippert/src/rakarrack-plus/src/Waveshaper.C -quiet -dumpbase Waveshaper.C -mfloat-abi=hard -mfpu=vfp -mtls-dialect=gnu -marm -march=armv6+fp -auxbase-strip CMakeFiles/rakarrack-plus.dir/Waveshaper.C.o -O3 -Wno-unused-parameter -std=c++11 -ffast-math -fsigned-char -ftree-vectorize -fvect-cost-model=dynamic -o -

That has reasonable optimization and is probably producing a portable binary within the Pi world.

Here is what I get from gcc -mcpu=native -march=native -Q --help=target

The following options are target specific:
  -mabi=                            aapcs-linux
  -mabort-on-noreturn               [disabled]
  -mandroid                         [disabled]
  -mapcs                            [disabled]
  -mapcs-frame                      [disabled]
  -mapcs-reentrant                  [disabled]
  -mapcs-stack-check                [disabled]
  -march=                           armv8-a+crc+simd
  -marm                             [enabled]
  -masm-syntax-unified              [disabled]
  -mbe32                            [enabled]
  -mbe8                             [disabled]
  -mbig-endian                      [disabled]
  -mbionic                          [disabled]
  -mbranch-cost=                    -1
  -mcallee-super-interworking       [disabled]
  -mcaller-super-interworking       [disabled]
  -mcmse                            [disabled]
  -mcpu=                            cortex-a72
  -mfix-cortex-m3-ldrd              [disabled]
  -mflip-thumb                      [disabled]
  -mfloat-abi=                      hard
  -mfp16-format=                    none
  -mfpu=                            vfp
  -mglibc                           [enabled]
  -mhard-float                      
  -mlittle-endian                   [enabled]
  -mlong-calls                      [disabled]
  -mmusl                            [disabled]
  -mneon-for-64bits                 [disabled]
  -mpic-data-is-text-relative       [enabled]
  -mpic-register=                   
  -mpoke-function-name              [disabled]
  -mprint-tune-info                 [disabled]
  -mpure-code                       [disabled]
  -mrestrict-it                     [disabled]
  -msched-prolog                    [enabled]
  -msingle-pic-base                 [disabled]
  -mslow-flash-data                 [disabled]
  -msoft-float                      
  -mstructure-size-boundary=        8
  -mthumb                           [disabled]
  -mthumb-interwork                 [disabled]
  -mtls-dialect=                    gnu
  -mtp=                             cp15
  -mtpcs-frame                      [disabled]
  -mtpcs-leaf-frame                 [disabled]
  -mtune=                           
  -muclibc                          [disabled]
  -munaligned-access                [enabled]
  -mvectorize-with-neon-double      [disabled]
  -mvectorize-with-neon-quad        [enabled]
  -mword-relocations                [disabled]

  Known ARM ABIs (for use with the -mabi= option):
    aapcs aapcs-linux apcs-gnu atpcs iwmmxt

  Known __fp16 formats (for use with the -mfp16-format= option):
    alternative ieee none

  Known ARM FPUs (for use with the -mfpu= option):
    auto crypto-neon-fp-armv8 fp-armv8 fpv4-sp-d16 fpv5-d16 fpv5-sp-d16 neon neon-fp-armv8 neon-fp16 neon-vfpv3
    neon-vfpv4 vfp vfp3 vfpv2 vfpv3 vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16

  Valid arguments to -mtp=:
    auto cp15 soft

  Known floating-point ABIs (for use with the -mfloat-abi= option):
    hard soft softfp

  TLS dialect to use:
    gnu gnu2

Mostly just an FYI, but I would like to push the optimization to the limits on my specific hardware and see if it improves anything, if possible, just as a test.

Rippert commented 3 years ago

Now I can't reproduce the faster startup. I even tried deleting the whole git folder and re-cloning it as well as trashing the preferences in my home folder. It seems like the origin/master branch and the origin/wip branch now both take 1m 14s. So I'm not sure if the wip branch really is an improvement as far as the slow startup is concerned.

Stazed commented 3 years ago

I just built without opts on the old single core and ran it. There was no difference in startup compared to opts. Even if there was an improvement that you said, to 24s, that is still way too long. With a quad core it should come up instant. Certainly not slower than my single. Same with everything else, pretty much instant load. Much better running though.

Could you get it to build with optimizations? I really cannot help with that, but perhaps check out the zynaddsubfx or yoshimi CMakeList.txt for possible arm optimizations flags. Seems like I recall something about that from them when I moved this to full cmake.

The only other thing that I would suggest is to set the git HEAD back to release 1.0.3 and build that. See if the problem still exists. If it still does, then keep going back by release. If the problem goes away at 1.0.3, then it is almost certainly an arm optimization issue and you gotta figure out how to build with opts.

Rippert commented 3 years ago

OK. I did manage to get cmake to add the optimizations I was trying for, and it didn't seem to make a difference. I'll try older releases and look at how Yoshimi and Zyn do it and let you know what I find. Like I said, overall it runs better than the old Rakarrack, so it's not some big fault, just something specific to the startup operations.

Thanks again, Ted

Rippert commented 3 years ago

Progress, of a sort.

I tried going back in the releases all the way to 0.7.0, they all had the long delay at startup.

Then I tried changing the frame/period, no effect.

Then sample rate, this had an effect. At 48000 Hz, it takes about 1m30s to start. At 44100 Hz it takes about 20s to start. So something to do with the sample rate.

I then added timers to all the initialization function calls in process.C:

rakarrack-plus 1.0.5 - Copyright (c) Josep Andreu - Ryan Billing - Douglas McClendon - Arnout Engelen
Try 'rakarrack-plus --help' for command-line options.
load_user_prefernces took 0.002521 seconds to complete 
Get_Bogomips took 0.105503 seconds to complete 
initialize_arrays took 0.000027 seconds to complete 
instantiate_effects took 18.776315 seconds to complete 
put_order_in_rack took 0.000342 seconds to complete 
MIDI_control took 0.001962 seconds to complete 
new_preset took 0.043970 seconds to complete 
new_bank took 0.000646 seconds to complete 
load_names took 0.015381 seconds to complete 
load_bank(BankFilename) took 0.004800 seconds to complete

SO, instantiate_effects is the source of the delay. Makes sense since I assume that is where various down-sampling and up-sampling is being setup, and the time to do that might depend upon the sample rate.

I don't know if this is what you were thinking about before, but it narrows it down for me. I'll keep looking into it.

Rippert commented 3 years ago

Some more info:

If I leave the sample rate at 44100 and turn on the Master Upsampling set to x4, the instantiate_effects process completes in about 8 seconds, instead of 19 sec. Other Master Upsampling settings seem to make things worse than 19 sec. I tried changing the various Quality settings, and they didn't seem to have any effect on startup time.

I did try the Raspberry Pi 4 specific optimizations from Yoshimi, -march=native -mfloat-abi=hard -mfpu=vfp -mcpu=cortex-a72 -mtune=cortex-a72 -pipe -mvectorize-with-neon-quad -funsafe-loop-optimizations, and they didn't have any effect on the startup time.

Stazed commented 3 years ago

That does narrow it down. Not what I was thinking about at all. Given that it occurs going back to 0.7.0 means that it is not anything I did :)! Changing the sample rate does change start up on my end as well, but still nothing like the delay you are having. 2s at most. I agree that optimizations are not the problem. There was an issue someone had a long time back that involved libsamplerate and the LV2s. Seemed that libsamplerate was loading slowly which caused a problem for the host instantiating the plugin. Might be an area to look at, as libsamplerate does the resampling. Just guessing at this point...

Stazed commented 3 years ago

Another guess. You mentioned that you Had to do some CPU affinity stuff and minimize the GUIs to get that without xruns. This looks like your CPU rt optimizations may be starving non rt activities. Makes sense when running rt. The reason that the slow startup likely does not affect guitarix is that gx is a plugin host and does not load all it's effects on start. It just creates and loads/deletes from the rack when user selected. Rakarrack-plus loads all 47 effects on start which is not rt safe. The benefit of doing this is changing effects from bank selection or MIDI program change is rt safe as the effects are simply turned on or off, not created and deleted. The benefit of gx way is unlimited effects. Perhaps relaxing some of the "CPU affinity stuff" would get you a faster startup. But of course, worse for running.

Rippert commented 3 years ago

Actually, I turned off all the cpu isolation and affinity stuff when I did the tests yesterday on the startup times, so that’s not it. Also, the xruns seem to be independent of the startup time issue. When I set the Master Upsampling to 4x, it lowered the startup time, but increased the xruns.

I think you’re not quite right about how Guitarix works either. I once asked Hermann if Gx could be changed to allow more than one copy of a plugin and he said no because it preloads them for seamless program changes.

You may be onto something about libsamplerate though. Guitarix uses zita-resampler, not libsamplerate, so it could be something in libsamplerate is just really poorly optimized for an ARM cpu, while zita-resampler is less so. Actually, it was the Guitarix splash screen that gave me the idea for the original suggestion in this issue, as Guitarix throws one up for a few seconds before the main window shows up.

On Feb 11, 2021, at 10:23 AM, Stazed notifications@github.com wrote:

Another guess. You mentioned that you Had to do some CPU affinity stuff and minimize the GUIs to get that without xruns. This looks like your CPU rt optimizations may be starving non rt activities. Makes sense when running rt. The reason that the slow startup likely does not affect guitarix is that gx is a plugin host and does not load all it's effects on start. It just creates and loads/deletes from the rack when user selected. Rakarrack-plus loads all 47 effects on start which is not rt safe. The benefit of doing this is changing effects from bank selection or MIDI program change is rt safe as the effects are simply turned on or off, not created and deleted. The benefit of gx way is unlimited effects. Perhaps relaxing some of the "CPU affinity stuff" would get you a faster startup. But of course, worse for running.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Stazed/rakarrack-plus/issues/24#issuecomment-777657660, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRNJYTWZ5TB2HEC4L5RKLDS6QHB7ANCNFSM4XJVOSBA.

Stazed commented 3 years ago

I assumed Guitarix would be doing the same thing for internal plugins as it does for ladspa and LV2. It does not seem possible to be loading all those externals on start. Very possible internals are handled differently. Zita is something that I plan on taking a close look at down the road. The biggest CPU hogs for R+ are those using libsamplerate. If zita is as good as the the ad says, it would be worth the effort to move over. Have you heard of any one else using a setup like yours having the same problem? Still seems that even if libsamplerate is the problem, it should not be that severe.

Rippert commented 3 years ago

No, I don't know anyone else using Rakarrack(or R+) on an RPi. I'm sure other people are, but I don't know them.

Rippert commented 3 years ago

Still more info, I broke out my old RPi3 and built R+ on it. It opens up with no delay!

Checked libsamplerate, it 1.8 on the RPi3 and 1.9 on the RPi4. The RPi3 is running debian 9 and the RPi4 is running debian 10. They each only have one version of libsamplerate available, so I can't cross check by switching versions and recompiling. I'll see if I can upgrade my RPi3 to debian 10, you can't downgrade the RPi4 to 9, it won't run.

I can't seem to find any git type repository for libsamplerate that covers both 1.8 and 1.9, (it was 5 years in between the two), so no obvious way to see what changed. Might not even be libsamplerate, of course, but it'd be nice to look at what's different in it.

Stazed commented 3 years ago

I am not even going to guess at it this time, especially after my lazy guess about Guitarix. I'd be surprised if it was libsamplerate though, 1m30s is way, way, way to long.

Rippert commented 3 years ago

I think you're right. I tried building libsamplerate 1.9 from source on the RPi 3 and rebuilding R+. Still starts right up with no delay.

Still, there has to be a reason the delay changes from 1m30s to 20s when I change the sampling rate.

Rippert commented 3 years ago

Tried R+ on my RPi 4 running Manjaro-KDE. Started right up without delay. I've renamed the issue to reflect that this is specific to Raspberry Pi OS (Buster). I'll look into it more when I get the chance, but this is probably a bug in the RPI OS, not R+, at this point.

Stazed commented 3 years ago

@Rippert any progress on this? I want to clear up as much as possible before the coming release. I suppose there is always the last resort to try, the dreaded re-install...

Rippert commented 3 years ago

No, haven't looked at it recently. I'll got some new sd cards coming next week. I'll try a fresh Buster install then using the wip branch (or latest release if you get there first).

Rippert commented 3 years ago

Just did a fresh build of R+ on a fresh install of Raspbian OS 10. Still takes a long time to open the main window. So I'll start looking at what part of the Instantiate_effects code is causing the slowdown.