Request: Test on a raspberry pi unit needed (with faster compilation)

zeehio commented 7 years ago

Hi,

One of the most common complaints about mimic is the long compile times it requires. The main cause for that is that the mycroft voice is compiled embedded in the mimic binary, instead of being loaded on runtime from a file.

We don't load the voice from a file on runtime because it is too slow. However, if we were able to improve the voice loading functions then we could stop embedding it at compilation time. So far, @forslund has made some improvements in #85 but still there is room for improvement.

I need someone to test a command that loads the mycroft voice from file. Then that person needs to compile mimic with a patch that may improve voice loading performance slightly and then check if there is a significant improvement or not.

Download and compile the development version

We will disable all the embedded voices (with --disable-voices-all) to make compilation much faster:

git clone https://github.com/MycroftAI/mimic.git
cd mimic
git checkout development
./configure --disable-voices-all
make

Test the timings (copy the output of this command)

time ./mimic -voice voices/mycroft_voice_4.0.flitevox  -t "" a.wav

Clean up:

make distclean

Try the patched version:

git remote add zeehio https://github.com/zeehio/mimic.git
git checkout zeehio/cg_maybe_faster_load
./configure --disable-voices-all
make

Test the patched version (copy the output of this command)

time ./mimic -voice voices/mycroft_voice_4.0.flitevox  -t "" a.wav

Thanks to anyone who can help on this

forslund commented 7 years ago

Cool, I'll see if I can find my raspberry Pi.

I'm working on a memory pool allocator for all the small allocs, but I believe more in this approach.

What sort of time imporvements would be expected? I assume you tested on your own PC?

zeehio commented 7 years ago

The number of calls to cst_safe_alloc decreased from 797956 calls to 294710. So after this commit we do 37% of the allocations in order to load the mycroft voice.

I will check the timings, I believe it was something like: before 0.2s after 0.15 but I am not sure.

forslund commented 7 years ago

That's a decent improvement. I seem to have my Pi packed away still so it'll take me a while to find it...

zeehio commented 7 years ago

I was afraid of disk cache so I tried again. The first run of each case shows the performance without any cache and you can see (at the end of this message) that the timing drops from 0.74s to 0.67s, merely a 10% improvement.

Thinking a bit, my expectations are not that good on a raspberry pi: Reading from rpi forums, it seems the SD read speed in the pi is about 40-50MB/s. The mycroft_voice_4.0.flitevox file is 69MB, so there are 69/40 = 1.75 seconds we won't be able to avoid in any way.

If disk reading is the limiting factor then using ./mimic with all the voices embedded should also be even slower, because the mimic binary needs to be read from disk to RAM. The most fair comparison would be:

All voices embedded, no flitevox file loaded: (this is how things are now)

./configure && make && time ./mimic -t "" a.wav

No voices embedded, use flitevox file: (this is how things would be)

./configure --disable-voices-all && make && time ./mimic -voice voices/mycroft_voice_4.0.flitevox -t "" a.wav

I am afraid some previous benchmarks may have been done with both embedded voices and loading voices from file, and that is the most unfair situation as we have both a flitevox file and a large mimic binary to read.

Real numbers will be very welcome.

Without optimization:

1st run:

real 0m0.742s
user 0m0.272s
sys 0m0.024s

2nd run:

real 0m0.275s
user 0m0.240s
sys 0m0.032s

3rd run:

real 0m0.274s
user 0m0.248s
sys 0m0.024s

With optimization:

1st run:

real 0m0.674s
user 0m0.108s
sys 0m0.048s

2nd run:

real 0m0.151s
user 0m0.112s
sys 0m0.036s

3rd run:

real 0m0.150s
user 0m0.120s
sys 0m0.028s

forslund commented 7 years ago

I'm rebuilding my raspberry pi image at the moment. I think my conclusion when I checked into this was the same. Disk I/O was the large issue.

That said, this seems to be an improvement in any case.

If all else fails, pymimic runs OK keeping the voice file in memory.

zeehio commented 7 years ago

Oh yes, pymimic should be the way to go.

LongBoolean commented 7 years ago

Has anyone considered compiling mimic using a "Unity Build" (single compilation unit). The compilation speed benefits are pretty large, because in reduces redundant compilation. Being IO bound is rarely the problem, mostly it is compiling all of those redundant includes, and if your files are as big as some of the mimic voice data files are, I can see why that could add up pretty quickly. Some say they have cut their times by 90%, my guess is that mileage may vary based on project, but from what others have said and my own experience with unity builds, I have noticed 50% seems to be the lower bound.

I have noticed that there seem to be mixed feelings for unity builds online. Some people love them, some people think they are a hack.(Seems no more of a hack to me than using bash scripts or makefiles) It looks like most of the naysaying comes from people who run into some problems on c++ codebases that use a lot of c++ features (namespaces, templates, etc), as C++(convention methods of building it) seem to kind of assume the use of multiple compilation units. I haven't really heard anything bad about using it with C other than some crazy macros will be even more crazy if your not careful. I've heard some people say it makes your code unmaintainable, (lol I have also heard that functions with more than 10 lines of code are unmaintainable, thats the internet for you) but I have seen codebases put everything into one compilation unit and are totally fine. (although I think it helps if the project starts development that way, since switching to it later means touching quite a few places) I have heard of people running out of memory when compiling this way. I've never seen it happen, but as rare as it may be on a desktop, I thought I would mention it since compiling on a rasberryPI is a retirement in this case.

Your thoughts? Are there any other promising methods that you know of for reducing the total build time?

On Mon, Jun 12, 2017 at 10:16 AM, Sergio Oller notifications@github.com wrote:

Oh yes, pymimic should be the way to go.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/mimic/issues/122#issuecomment-307821085, or mute the thread https://github.com/notifications/unsubscribe-auth/ACp7HOooNu7oUoW2MbdMAOf7DWVrEbDqks5sDVZZgaJpZM4N29qY .

zeehio commented 7 years ago

Yes, I considered unity builds after learning about them in the meson build system.

Mimic compilation issues on the raspberry pi are:

high ram usage due to very large compilation units
slow compilation of the Mycroft voice due to large compilation units.

The large compilation units are caused by huge structures that are compiled once.

Using a unity build on a raspberry pi for building mimic will likely make the pi system run out of memory because we would make one super huge compilation unit combining all the huge structures.

The best way to reduce the compilation time is to avoid compiling those large structures in the voice files and make mimic use .flitevox files. In order to do that we must ensure that the loading voice from file function is competitive in speed with respect to embedding the voice at compile time. And that is what we are working at right now.

forslund commented 7 years ago

Sorry for the delay, my Pi was on the fritz and these tests have been run on a rPi 2 instead of a 3. The results look good, with a noticeable speed incease. It will have a bit less effect on the raspberry Pi 3 since it has a faster CPU, I have no idea how much though.

Results:

High load cached

Zeehio faster

real    0m8.496s
user    0m1.820s
sys     0m0.680s

Development

real    0m12.882s
user    0m2.890s
sys     0m0.730s

Low Load Cached

Zeehio faster

real    0m2.693s
user    0m1.850s
sys     0m0.650s

Development

real    0m3.881s
user    0m2.670s
sys     0m0.730s

Low Load Uncached

Zeehio faster

real    0m5.159s
user    0m1.780s
sys     0m0.980s

Development

real    0m6.025s
user    0m2.590s
sys     0m1.120s

I cleared the cache with the following snippet: free && sync && echo 3 > /proc/sys/vm/drop_caches && free

zeehio commented 7 years ago

That's great!

I think I may be able to shave a few more seconds by changing how the CART trees are serialized. But it will take a while.

Just to know the details... did you compile mimic with --disable-voices-all in any of those cases?

forslund commented 7 years ago

./configure --disable-voices-all was the commandline I used.

forslund commented 7 years ago

Here are some uncached times for the Pi3. Not as big a boost but noticeable!

Pi3: Zeehio go faster

real    0m4.553s
user    0m0.990s
sys     0m0.360s

Pi3: development

real    0m5.578s
user    0m1.500s
sys     0m0.370s

zeehio commented 7 years ago

Good to know that the change is noticeable. There still is room for improvement in the flite serialization of cart trees. I hope to be able to improve them in the future

zeehio commented 7 years ago

By the way, talking about improvements... has anyone explored different CFLAGS? The combination of O3 with ffast-math gave on my workstation a 27% improvement in speech synthesis.

This test may take a while on a pi, because synthesizing doc/alice gives a 1h long wav file. You may want to use only 10% of the doc/alice document for testing on a pi (10% of doc/alice should be long enough to measure the differences).

Before O3 fast-math

./configure --disable-voices-all
make
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_no_ffastmath_O3.wav

real    1m25.348s
user    1m25.132s
sys 0m0.212s

After O3 fast-math

./configure --disable-voices-all CFLAGS="-O3 -ffast-math"
make
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_ffastmath_O3.wav

real    1m2.789s
user    1m1.976s
sys 0m0.280s

Extra possible optimization (only on the pi3, not pi2, not pi1, not pi0), flags from here:

./configure --disable-voices-all CFLAGS="-O3 -ffast-math -mcpu=cortex-a53  -mfpu=neon-fp-armv8"
make
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_ffastmath_O3_cpu_fpu.wav

@forslund Can you still do some testing on your pi3? I don't have any

forslund commented 7 years ago

Sure thing, now I have mine set up in a good way for testing. It might have to wait until tomorrow morning.

zeehio commented 7 years ago

I just edited the post to add --disable-voices-all otherwise you will spend an awful lot of time compiling mimic... Oh and with --disable-voices-all it is safe to use make -j4

forslund commented 7 years ago

yeah, I never build the voices if I can help it =) -j4 was a good tip though

zeehio commented 7 years ago

I don't know if as of today -march=native includes the fpu/CPU specific optimizations.. as you can see from the link I gave, it was not the case in April 2016.

Anyway, proper CFLAGS are something that usually is handled by the distribution packagers and not us, because there are thousands of combinations of compilers, architectures, cross-compilation scenarios and use cases.

Let's see if that has any kind of impact first, then document it later and if it has an impact we can suggest mycroft-core and other packagers to change their build flags

El dia 28 juny 2017 5:12 a. m., "el-tocino" notifications@github.com va escriure:

does -march=native include the f/cpu-specific optimizations (other than ffast-math)? Would be slightly more compatible for other systems than just pi ARM types that way. Elsewise, putting a cpu check in the build script might be possible if they're explicit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/mimic/issues/122#issuecomment-311544965, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEmsbkiTcDZJj7nyGmtE5k7wtucyB3Rks5sIcSigaJpZM4N29qY .

el-tocino commented 7 years ago

pi3, picroft .8.16 time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_ffastmath_O3.wav cg_maybe_faster_load: real 12m33.637s user 12m28.250s sys 0m3.530s

same branch as above with pi3 copts in place (./configure --disable-voices-all CFLAGS="-O3 -ffast-math -mcpu=cortex-a53 -mfpu=neon-fp-armv8") real 10m48.887s user 10m42.190s sys 0m4.180s

Default mycroftai mimic with pi3 copts: real 11m54.623s user 11m45.990s sys 0m6.410s

forslund commented 7 years ago

My (development branch)

normal
real    17m6.581s
user    17m3.220s
sys     0m3.100s

-O3 -ffastmath
real    16m18.952s
user    16m15.620s
sys     0m3.260s

-O3 -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard -funsafe-math-optimizations
real    16m23.169s
user    16m19.900s
sys     0m3.210s

-O3 -ffast-math -mcpu=cortex-a53  -mfpu=neon-fp-armv8
real    16m36.796s
user    16m33.260s
sys     0m3.420s

forslund commented 7 years ago

Tried -Ofast as well:

real    16m38.909s
user    16m35.770s
sys     0m3.070s

el-tocino commented 7 years ago

Ran again with SLT voice and writing to ramdisk: pi@picroft:~/Build/mimic $ time ./mimic -voice voices/cmu_us_slt.flitevox -f doc/alice /ram/test.wav

cg_maybe_faster_load real 4m55.212s user 4m53.290s sys 0m1.090s

cg_maybe w/copts real 4m56.311s user 4m54.660s sys 0m1.040s

mycroft w/copts real 4m42.239s user 4m39.870s sys 0m1.120s

zeehio commented 7 years ago

The difference between the voices is kind of expected. The sampling rate of the Mycroft voice is 44100Hz while the sampling rate of slt is 16000Hz, so we synthesize about three times more samples with the Mycroft voice. In any case all those scenarios are several times faster than real time synthesis

el-tocino commented 7 years ago

Didn't want the sd card or io to be the limiting factor. Also I use slt normally. :)

forslund commented 7 years ago

I'm currently installing gcc-6 (.2 I think) to see if any improvements have been made since 4.9. I don't have very high hopes but it might be worth a try.

forslund commented 7 years ago

Did a quick profiling on my PC These are the top 4 cpu hogging functions according to gprof

 64.48     66.43    66.43 62197740     0.00     0.00  mlsadf
 11.79     78.58    12.15      251    48.41   360.97  synthesis_body
 11.67     90.60    12.02   565936     0.02     0.02  b2en
  4.77     95.51     4.91 14525906     0.00     0.00  internal_ff.constprop.1

I think mlsadf1 and mlsadf2 are inlined by the compiler and are hence not shown separately.

MycroftAI / mimic1