Performance issue when running on arm64

kng commented 1 year ago

I've been struggling with my docker images that I prepared for satnogs-client and gr-satellites lately. It looked like everything was working fine, then I noticed quite high cpu load on some observations (bpsk 9k6), also no demods on those.

Packaged release for satnogs here. I did build it from source as well, no change, branch maint-3.8 as it's debian bullseye.

It complains on too much udp data and dropping packets, but I think this is due to the load/hang so it's a symptom. gr::log :WARN: udp_source0 - Too much data; dropping packet. We have solved this issue earlier with the udp.conf.

We did some experiments, tried running volk_profile and messing around quite a bit but without success. I did stumble on a repeatable process to test it: Fetch audio from https://network.satnogs.org/observations/6746907/ sox satnogs_6746907_2022-11-15T15-31-31.ogg satnogs_6746907_2022-11-15T15-31-31.wav time gr_satellites 43199 --rawint16 satnogs_6746907_2022-11-15T15-31-31.wav --samp_rate 48e3 time gr_satellites 43199 --wavfile satnogs_6746907_2022-11-15T15-31-31.wav --samp_rate 48e3

On amd64 both completes in a few seconds. On arm64 the --wavfile just stalls, I left it for 10h and no progress, --rawint16 completes in about 20s on a rpi4. Tested with v3.15 and v5.1.0 on amd64 and v3.15 on arm64. (Technically --samp_rate not needed for --wavfile but it didn't seem to do any harm)

I also tested running with --dump_path on arm64 on both --rawint16 and --wavfile, the latter just grew and grew in size.

rawint16/:
total 445920
-rw-r--r-- 1 999 docker 101491376 Nov 16 23:26 agc_in.c64
-rw-r--r-- 1 999 docker 101491376 Nov 16 23:26 agc_out.c64
-rw-r--r-- 1 999 docker  10134844 Nov 16 23:26 clock_recovery_err.f32
-rw-r--r-- 1 999 docker  20269688 Nov 16 23:26 clock_recovery_out.c64
-rw-r--r-- 1 999 docker  10134844 Nov 16 23:26 clock_recovery_T_avg.f32
-rw-r--r-- 1 999 docker  10134844 Nov 16 23:26 clock_recovery_T_inst.f32
-rw-r--r-- 1 999 docker  10134844 Nov 16 23:26 costas_error.f32
-rw-r--r-- 1 999 docker  10134844 Nov 16 23:26 costas_frequency.f32
-rw-r--r-- 1 999 docker  20269688 Nov 16 23:26 costas_out.c64
-rw-r--r-- 1 999 docker  10134844 Nov 16 23:26 costas_phase.f32
-rw-r--r-- 1 999 docker  50745688 Nov 16 23:26 fll_error.f32
-rw-r--r-- 1 999 docker  50745688 Nov 16 23:26 fll_freq.f32
-rw-r--r-- 1 999 docker  50745688 Nov 16 23:26 fll_phase.f32

wavfile/:
total 1146960
-rw-r--r-- 1 999 docker    196144 Nov 16 23:26 agc_in.c64
-rw-r--r-- 1 999 docker    130616 Nov 16 23:26 agc_out.c64
-rw-r--r-- 1 999 docker 117423008 Nov 16 23:27 clock_recovery_err.f32
-rw-r--r-- 1 999 docker 234846016 Nov 16 23:27 clock_recovery_out.c64
-rw-r--r-- 1 999 docker 117423008 Nov 16 23:27 clock_recovery_T_avg.f32
-rw-r--r-- 1 999 docker 117423008 Nov 16 23:27 clock_recovery_T_inst.f32
-rw-r--r-- 1 999 docker 117328808 Nov 16 23:27 costas_error.f32
-rw-r--r-- 1 999 docker 117390244 Nov 16 23:27 costas_frequency.f32
-rw-r--r-- 1 999 docker 234780488 Nov 16 23:27 costas_out.c64
-rw-r--r-- 1 999 docker 117390244 Nov 16 23:27 costas_phase.f32
-rw-r--r-- 1 999 docker     32944 Nov 16 23:26 fll_error.f32
-rw-r--r-- 1 999 docker     32944 Nov 16 23:26 fll_freq.f32
-rw-r--r-- 1 999 docker     32944 Nov 16 23:26 fll_phase.f32

htop while running a bpsk 9k6 obs, by70-2 iirc: bild

htop while running similar obs on amd64: bild

edit, forgot the screendump on the file running on rpi4: bild

daniestevez commented 1 year ago

Having the --wavfile run never finishing looks like a serious bug. I wonder why the run with --rawint16 finishes just fine, as there is not much difference between the two, other than the --wavfile using a WAV File Source (which uses libsndfile) and --rawint16 using a File Source.

Given that the output of --dump_path grows indefinitely with --wavfile, I'm inclined to think that there is a bug either in the WAV File Source or in libsndfile that is making the file loop indefinitely rather than reading it once and finishing. Can you try to do a simple flowgraph to check if this is the case? You only need a WAV File source set to "Repeat: No", and send its output to a File Sink. Run this flowgraph using your test WAV file as input (you can run it in no GUI mode) and check if the output file grows indefinitely.

Unfortunately, I don't think that a potential bug in how the WAV files are read is related to your original problem about high CPU load. Maybe we're dealing with two different problems here, but to me it makes more sense to first deal with the one about the WAV file.

kng commented 1 year ago

Unfortunately all my rpi runs headless at the moment, so hard to set up. I could build a flowgraph that replays one of the iq_dump_file from the satnogs observations to udp so it can be tested live so to say. I also don't have gr-satellites installed on the system, only in container.

Also the wav/rawint is probably just a trigger for the bug, as in the real useage I'm doing --udp --iq as seen here.

The input path in gr_satellites is a bit different for all these, more than just the source type.

It actually runs just fine with other satellites/modes, I have narrowed it down to a few that runs bpsk style flowgraphs, including lilacsat-2, by70-2 and a few more.

daniestevez commented 1 year ago

You can run this test on a headless system. You just need gnuradio-companion on a Desktop machine (if possible, a similar version of GNU Radio, to avoid problems). No need to have gr-satellites installed, since the test flowgraph I proposed only uses the in-tree "WAV File Source" and "File Sink" blocks. Set this flowgraph to "No GUI" in the generate options, generate the Python script, move the script to the rpi and run it there.

It actually runs just fine with other satellites/modes, I have narrowed it down to a few that runs bpsk style flowgraphs, including lilacsat-2, by70-2 and a few more.

Have you tried supplying the input by UDP to the flowgraph at a slower rate than real time (say at x0.1)? It might be that there is a performance bottleneck for some particular demodulators or deframers that makes the embedded system unable to keep up in real time.

kng commented 1 year ago

So I made a small flowgraph to replay satnogs iq dump here named iqdump2udp.py. Perhaps temporary, link to sample iq dump.

Run these in two terminals on the same host/container: ./iqdump2udp.py -f iq_6751281_76800.raw -s 76800 gr_satellites 45857 --samp_rate 76800 --iq --udp --udp_raw --udp_port 57356

This seems to work fine on a amd64 system, but struggles as expected on arm64. Even outputting a few seconds at 768 sps will make gr_satellites consume all cpu seemingly forever. bild

daniestevez commented 1 year ago

Can you add --dump_path to that and check if the output size grows too much? You can compare the output size between running on amd64 and arm64. It should be very similar (perhaps not exactly the same because the symbol clock recovery might do differently due to numerical errors or other factors). If you get a much larger output on arm64, or if it seems to grow way past the size produced on amd64, something is definitely wrong here. I have the impression that more data that we feed in is getting produced/duplicated somehow, yet still I don't understand how this happens.

kng commented 1 year ago

Repeated previous test with feeding a small amount of samples and terminating that. Letting gr_satellites run, it didn't complain on any dropped frames. Letting the --dump_path fill up to about 1GB, gives the following result:

total 1.1G
-rw-r--r-- 1 pi pi  32K Nov 18 00:12 agc_in.c64
-rw-r--r-- 1 pi pi  32K Nov 18 00:12 agc_out.c64
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 clock_recovery_err.f32
-rw-r--r-- 1 pi pi 211M Nov 18 00:13 clock_recovery_out.c64
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 clock_recovery_T_avg.f32
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 clock_recovery_T_inst.f32
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 costas_error.f32
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 costas_frequency.f32
-rw-r--r-- 1 pi pi 211M Nov 18 00:13 costas_out.c64
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 costas_phase.f32
-rw-r--r-- 1 pi pi  16K Nov 18 00:12 fll_error.f32
-rw-r--r-- 1 pi pi  16K Nov 18 00:12 fll_freq.f32
-rw-r--r-- 1 pi pi  16K Nov 18 00:12 fll_phase.f32

This is running on a up to date rpi4 debian bullseye, with gr-satellites 3.15 from satnogs repo, no docker.

daniestevez commented 1 year ago

That listing is very useful, thanks. We can see that the AGC and FLL outputs are small (probably the same size as those obtained in amd64), while the clock recovery (Symbol Sync block) and Costas loop outputs are very large (and presumably would keep growing indefinitely).

The Costas loop is downstream of the Symbol Sync block, so I bet the culprit here is the Symbol Sync block. It seems to be producing an infinite amount of output, and the Costas loop simply processes the infinite input it's given. Since the Symbol Sync performs "variable rate resampling", it's not impossible that if it's terribly confused then it could generate an infinite amount of output from a finite input (for example if it believes that to get the next symbol it's okay to step zero input samples).

This starts to seem more like a GNU Radio bug (probably an architecture-dependent bug), rather than a gr-satellites bug. I would suggest trying to produce a minimal flowgraph that replicates the issue, and then open up an issue in the gnuradio repo. Unfortunately I don't have an arm64 system at hand, so I can't help much.

You mentioned that the FSK satellites seem to work and that the problem is with the BPSK satellites. This makes sense to me. The FSK Demodulator and the BPSK Demodulator components both use the Symbol Sync block, but with different configurations. The FSK Demodulator uses a Gardner TED and 8-tap MMSE resampler. The BPSK Demodulator uses an ML TED and polyphase MF resampler. It's probably the latter configuration the one that is broken.

I could try to make a minimal flowgraph that replicates the problem: generate a finite input (either zeros or random noise), send that into a Symbol Sync configured as the BPSK Demodulator does, and then send the output to a file. But you'll need to test this for me.

Something else: GNU Radio 3.8 is pretty much end-of-life these days, so folks will probably not fix this bug in GNU Radio 3.8. Can you try to see if the problem also happens with GNU Radio 3.10? If it doesn't, then problem solved, I guess. If the problem still happens, then we have much better chances of getting this fixed.

daniestevez commented 1 year ago

More ideas to debug: in the past I have found issues with the symbol recovery blocks when there are NaN's (and perhaps also infinities, or other weird floats) in their input. The NaN's propagate through all the loop calculations and eventually the loop error is cast to an int to compute how many samples or polyphase filter arms to advance. Casting NaN to an int gives an absurd value, and this would typically cause a segfault when the int is used as the index into an array.

I'm thinking that this is kind of thing that could be architecture-dependent in subtle ways. So maybe it's worth to check that there are no NaN's or other weird floats at the input of the Symbol Sync.

To do this in an easy way, it's better to disable the FLL by using --disable_fll. Then the AGC will be directly connected to the Symbol Sync, so the input to the Symbol Sync will be the agc_out.c64 file. It might happen that if you use --disable_fll then the problem disappears. This would strongly point to the FLL as the culprit. But probably the problem will still persist with the FLL disabled. In that case, can you either check that in agc_out.c64 (it's a complex64 file) there are no NaN's, infinities, etc, or alternatively attach the file here so that I can check it?

kng commented 1 year ago

Tested running gr_satellites 45857 --samp_rate 76800 --iq --udp --udp_raw --udp_port 57356 --disable_fll --dump_path dump with the same short burst of udp data. Resulted in pretty much the same load scenario, and filled up dump path to about 1GB and terminated.

total 1.1G
-rw-r--r-- 1 pi pi  32K Nov 18 21:56 agc_in.c64
-rw-r--r-- 1 pi pi  32K Nov 18 21:56 agc_out.c64
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 clock_recovery_err.f32
-rw-r--r-- 1 pi pi 221M Nov 18 21:56 clock_recovery_out.c64
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 clock_recovery_T_avg.f32
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 clock_recovery_T_inst.f32
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 costas_error.f32
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 costas_frequency.f32
-rw-r--r-- 1 pi pi 221M Nov 18 21:56 costas_out.c64
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 costas_phase.f32

bild

daniestevez commented 1 year ago

Great. Can you share the agc_out.c64 file here?

kng commented 1 year ago

agc_inout.zip had to compress it to attach, both in/out.

daniestevez commented 1 year ago

In [1]: import numpy as np

In [2]: agc_out = np.fromfile('agc_out.c64', 'complex64')

In [3]: agc_out = np.fromfile('agc_out.c64', 'complex64')

In [4]: agc_out
Out[4]: 
array([        nan       +nanj,         nan       +nanj,
               nan       +nanj, ..., -0.07814392-0.05673159j,
       -0.1532042 +0.32974455j, -0.45565328+1.1761963j ], dtype=complex64)

These nan don't look good.

daniestevez commented 1 year ago

It's not going to be a GNU Radio bug, really (well, arguably the Symbol Sync block shouldn't do silly things if it gets nan's in the input). This is a huge deja vù of a bug in the gr-satellites AGC that I solved many years ago.

In [10]: agc_in[:20]
Out[10]: 
array([0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
       2.7911843e-22-1.8607895e-22j, 9.8180180e-08-6.5453456e-08j],
      dtype=complex64)

In [11]: agc_out[:20]
Out[11]: 
array([      nan      +nanj,       nan      +nanj,       nan      +nanj,
             nan      +nanj,       nan      +nanj,       nan      +nanj,
             nan      +nanj,       nan      +nanj,       nan      +nanj,
             nan      +nanj,       nan      +nanj,       nan      +nanj,
             nan      +nanj,       nan      +nanj,       nan      +nanj,
             nan      +nanj,       nan      +nanj,       nan      +nanj,
            -inf      +infj, 16.641005-11.094004j], dtype=complex64)

The problem is that the input starts with 18 zeros, for which the AGC produces nan as output. This basically happens because of 0/0. The AGC adds a tiny number on the denominator to avoid 0/0. But maybe that number is too tiny for arm64. Or something is built with -fast-math. I'm going to investigate a bit more to try to understand how this can happen in arm64 (I'm confident that on amd64 it doesn't happen, an I'm going to check it now).

kng commented 1 year ago

Ok, sound like it's possible to test, even with larger tiny numbers. pse advice how to modify and test this.

daniestevez commented 1 year ago

Here is a test flowgraph. It feeds 16 zeros to the RMS AGC block from gr-satellites and then prints the output on the screen.

agc_test.zip

On my amd64 machine this prints:

$ ./agc_test.py 
vmcircbuf_sysconfig :info: Using gr::vmcircbuf_sysv_shm_factory
agc_test_output = [0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j]

I'm doing this in GNU Radio 3.10, but hopefully it will work in your GNU Radio 3.8 without modifications. You can scp it to the rpi4 and run grcc agc_test.grc to compile it to a Python script.

Can you test if this gives nan in the rpi4?

kng commented 1 year ago

grcc agc_test.grc

>>> Warning: vocoder_codec2_decode_ps - option_attributes are for enums only, ignoring
>>> Warning: vocoder_codec2_encode_sp - option_attributes are for enums only, ignoring
<<< Welcome to GNU Radio Companion Compiler 3.8.2.0 >>>

Block paths:
        /usr/share/gnuradio/grc/blocks

>>> Loading: /home/pi/iqdump/agc_test.grc
>>> Generating: /home/pi/iqdump/agc_test.py
>>> Warning: This flow graph may not have flow control: no audio or RF hardware blocks found. Add a Misc->Throttle block to your flow graphto avoid CPU congestion.

./agc_test.py

agc_test_output = ((nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj))

daniestevez commented 1 year ago

Great. As expected, arm64 gives nan. We're narrowing the problem.

Here is another simple test.

add_const_test.zip

This generates 16 zeros and adds to them the small number 1e-20 using the Add Const block. This is what would happen in the RMS AGC block to avoid getting exactly zero on the denominator of the AGC.

In my amd64 machine this gives almost 1e-20, as expected.

]$ ./add_const_test.py 
vmcircbuf_sysconfig :info: Using gr::vmcircbuf_sysv_shm_factory
add_const_test_output = [9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21]

Can you test on the rpi4? I wouldn't be surprised if this gives exactly zero instead of 1e-20.

kng commented 1 year ago

./add_const_test.py

add_const_test_output = (9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21)

daniestevez commented 1 year ago

In case this becomes relevant: I've just noticed that in GNU Radio 3.8 Add Const doesn't use Volk (it uses a partially unrolled C++ loop), while in GNU Radio 3.10 it uses the volk_32f_x2_add_32f kernel.

daniestevez commented 1 year ago

add_const_test_output = (9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21)

Ok. That's slightly surprising. I'm going to do a better test that captures and prints all the intermediate calculations of the RMS AGC. Stand by a few minutes.

daniestevez commented 1 year ago

Here is the test:

rms_agc_test.zip

And this is what I get on my amd64 machine:

$ ./rms_agc_test.py 
vmcircbuf_sysconfig :info: Using gr::vmcircbuf_sysv_shm_factory
in_data [0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j]
rms_out [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
multiply_const_out [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
add_const_out [9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21]
float_to_complex_out [(9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j)]
divide_out [0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j]

kng commented 1 year ago

./rms_agc_test.py

in_data (0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j)
rms_out (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
multiply_const_out (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
add_const_out (9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21)
float_to_complex_out ((9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j))
divide_out ((nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj))

daniestevez commented 1 year ago

Okay. So it's the Divide block the one that is broken. This should be computing 0j / (9.999999682655225e-21+0j), which is 0, but it's giving nan instead. I'm going to take a look at the code to see where to put the blame.

Meanwhile, if you want to play, you can change the number 1e-20 in the last test (locate it in the .py or the .grc) and make it larger until the divide_out doesn't give nan. Maybe knowing where this breaks will give us a useful clue.

daniestevez commented 1 year ago

The Divide block for gr_complex data hasn't changed at all between 3.8 and 3.10. In both cases it uses volk_32fc_x2_divide_32fc.

Can you paste your $HOME/.volk/volk_config here? Also, which version of Volk are you running?

kng commented 1 year ago

self.blocks_add_const_vxx_0 = blocks.add_const_ff(5.5e-20) produces divide_out (0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j) and 5.4 does NaN .

kng commented 1 year ago

ii  libvolk2-bin       2.4.1-2      arm64        vector optimized runtime tools
ii  libvolk2-dev:arm64 2.4.1-2      arm64        vector optimized function headers
un  libvolk2-doc       <none>       <none>       (no description available)
ii  libvolk2.4:arm64   2.4.1-2      arm64        vector optimized functions

I thought I just ran volk_profile, but that might have been on another system. running now...

kng commented 1 year ago

volk_config.txt

daniestevez commented 1 year ago

If you didn't have $HOME/.volk/volk_config I think that Volk just uses the generic kernel.

Rather than running volk_profile (which could take quite a while, at least from what I'm used to in slower arm32 boards), I prefer that you edit your volk_config by hand and enter the following contents (these are the sole contents of the file):

volk_32fc_x2_divide_32fc generic

Then run the test.

Then replace the volk_config by

volk_32fc_x2_divide_32fc neon

With these we will check the generic and the NEON kernels, and see if they make any difference.

daniestevez commented 1 year ago

volk_config.txt

That was fast! Nice rpi4.

I see the following:

volk_32fc_x2_divide_32fc neon neon

This makes me realize that I messed up the syntax in my previous comment. We need to list 2 kernels: The one for aligned data and the one for unaligned data.

Can you run the rms_agc_test.py again to ensure that it fails when using the NEON kernel and then modify that line in volk_config to

volk_32fc_x2_divide_32fc generic generic

and run rms_agc_test.py to check what happens with the generic kernel?

kng commented 1 year ago

volk_profile is ~5min to plow through, not sure if it's iterations or time. tried all four combinations and produces identical results according to diff. (with 1e-20)

daniestevez commented 1 year ago

Since the generic test (which is plain old division of complex number in C) fails, as a sanity check, take the following simple program:

Edit: Disregard this program. The program is so simple that gcc already knows that the answer is 0 and doesn't even bother to do the math. I'll paste now something more sophisticate to force it.

#include <complex.h>
#include <stdio.h>

int main(int argc, char **argv) {
    float complex z = 0.0f;
    float complex w = 1e-20f;
    float complex x = z / w;
    printf("%e + 1j %e\n", creal(x), cimag(x));    
    return 0;
}

Can you build and test this. It should print 0.000000e+00 + 1j 0.000000e+00. For me it does, both in an amd64 machine, as well as in a Beaglebone Black (arm32).

Can you test this on the rpi4? Try building it with -O3, and maybe -ffast-math to try to get it to fail.

kng commented 1 year ago

as you said, it'll produce 0.000000e+00 + 1j 0.000000e+00 regardless...

daniestevez commented 1 year ago

Awesome. I now have a test that breaks even on amd64 with -ffast-math (without -ffast-math it does work).

#include <complex.h>
#include <stdio.h>

float complex __attribute__ ((noinline)) divide(float complex z, float complex w) {
    return z / w;
}

int main(int argc, char **argv) {
    float complex z = 0.0f;
    float complex w = 1e-20f;
    float complex x = divide(z, w);
    printf("%e + 1j %e\n", creal(x), cimag(x));    
    return 0;
}

Let me know if this works with -O3 and with -O3 -ffast-math.

daniestevez commented 1 year ago

Same results on the Beaglebone black. With -O3 -ffast-math it gives nan, while with -O3 only it gives zero as expected.

kng commented 1 year ago

yes, -ffast-math gives nan, the others ~0

daniestevez commented 1 year ago

Could it be that your libvolk2-bin debian package was built with -ffast-math I wonder? Where does this package come from exactly? If I'm able to grab that exact .deb package I might look at the assembler code and compare to what I get with -ffast-math and without on arm32.

daniestevez commented 1 year ago

For the record, on arm32, this is the divide function built with -O3 -ffast-math:

0000062c <divide>:
 62c:   ee20 7a21   vmul.f32    s14, s0, s3
 630:   ee61 7aa1   vmul.f32    s15, s3, s3
 634:   ee60 1aa1   vmul.f32    s3, s1, s3
 638:   ee41 7a01   vmla.f32    s15, s2, s2
 63c:   ee10 7a81   vnmls.f32   s14, s1, s2
 640:   ee40 1a01   vmla.f32    s3, s0, s2
 644:   eec7 0a27   vdiv.f32    s1, s14, s15
 648:   ee81 0aa7   vdiv.f32    s0, s3, s15
 64c:   4770        bx  lr
 64e:   bf00        nop

And this is with -O3 only:

00000614 <divide>:
 614:   f000 b800   b.w 618 <__divsc3>

00000618 <__divsc3>:
 618:   eeb0 7ac1   vabs.f32    s14, s2
 61c:   eef0 7ae1   vabs.f32    s15, s3
 620:   eeb4 7ae7   vcmpe.f32   s14, s15
 624:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 628:   d526        bpl.n   678 <__divsc3+0x60>
 62a:   eec1 7a21   vdiv.f32    s15, s2, s3
 62e:   eef0 6a61   vmov.f32    s13, s3
 632:   eef0 5a60   vmov.f32    s11, s1
 636:   eeb0 6a40   vmov.f32    s12, s0
 63a:   ee41 6a27   vmla.f32    s13, s2, s15
 63e:   ee47 5a80   vmla.f32    s11, s15, s0
 642:   ee17 6aa0   vnmls.f32   s12, s15, s1
 646:   ee85 7aa6   vdiv.f32    s14, s11, s13
 64a:   eec6 7a26   vdiv.f32    s15, s12, s13
 64e:   eef4 7a67   vcmp.f32    s15, s15
 652:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 656:   eeb4 7a47   vcmp.f32    s14, s14
 65a:   bf14        ite ne
 65c:   2301        movne   r3, #1
 65e:   2300        moveq   r3, #0
 660:   f003 0301   and.w   r3, r3, #1
 664:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 668:   bf08        it  eq
 66a:   2300        moveq   r3, #0
 66c:   b9bb        cbnz    r3, 69e <__divsc3+0x86>
 66e:   eeb0 0a47   vmov.f32    s0, s14
 672:   eef0 0a67   vmov.f32    s1, s15
 676:   4770        bx  lr
 678:   eec1 7a81   vdiv.f32    s15, s3, s2
 67c:   eef0 6a41   vmov.f32    s13, s2
 680:   eef0 5a40   vmov.f32    s11, s0
 684:   eeb0 6a60   vmov.f32    s12, s1
 688:   ee41 6aa7   vmla.f32    s13, s3, s15
 68c:   ee40 5aa7   vmla.f32    s11, s1, s15
 690:   ee00 6a67   vmls.f32    s12, s0, s15
 694:   ee85 7aa6   vdiv.f32    s14, s11, s13
 698:   eec6 7a26   vdiv.f32    s15, s12, s13
 69c:   e7d7        b.n 64e <__divsc3+0x36>
 69e:   eeb5 1a40   vcmp.f32    s2, #0.0
 6a2:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 6a6:   eef5 1a40   vcmp.f32    s3, #0.0
 6aa:   bf0c        ite eq
 6ac:   2301        moveq   r3, #1
 6ae:   2300        movne   r3, #0
 6b0:   f003 0301   and.w   r3, r3, #1
 6b4:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 6b8:   bf18        it  ne
 6ba:   2300        movne   r3, #0
 6bc:   b1c3        cbz r3, 6f0 <__divsc3+0xd8>
 6be:   eef4 0a60   vcmp.f32    s1, s1
 6c2:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 6c6:   d004        beq.n   6d2 <__divsc3+0xba>
 6c8:   eeb4 0a40   vcmp.f32    s0, s0
 6cc:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 6d0:   d10e        bne.n   6f0 <__divsc3+0xd8>
 6d2:   ee11 3a10   vmov    r3, s2
 6d6:   ed9f 7a73   vldr    s14, [pc, #460] ; 8a4 <__divsc3+0x28c>
 6da:   eddf 7a73   vldr    s15, [pc, #460] ; 8a8 <__divsc3+0x290>
 6de:   2b00        cmp r3, #0
 6e0:   bfb8        it  lt
 6e2:   eef0 7a47   vmovlt.f32  s15, s14
 6e6:   ee20 7a27   vmul.f32    s14, s0, s15
 6ea:   ee60 7aa7   vmul.f32    s15, s1, s15
 6ee:   e7be        b.n 66e <__divsc3+0x56>
 6f0:   ee30 6a40   vsub.f32    s12, s0, s0
 6f4:   eeb4 0a40   vcmp.f32    s0, s0
 6f8:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 6fc:   eeb4 6a46   vcmp.f32    s12, s12
 700:   bf0c        ite eq
 702:   2301        moveq   r3, #1
 704:   2300        movne   r3, #0
 706:   f003 0301   and.w   r3, r3, #1
 70a:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 70e:   bf08        it  eq
 710:   2300        moveq   r3, #0
 712:   2b00        cmp r3, #0
 714:   d17c        bne.n   810 <__divsc3+0x1f8>
 716:   ee70 6ae0   vsub.f32    s13, s1, s1
 71a:   eef4 0a60   vcmp.f32    s1, s1
 71e:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 722:   eef4 6a66   vcmp.f32    s13, s13
 726:   bf0c        ite eq
 728:   2201        moveq   r2, #1
 72a:   2200        movne   r2, #0
 72c:   f002 0201   and.w   r2, r2, #1
 730:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 734:   bf08        it  eq
 736:   2200        moveq   r2, #0
 738:   2a00        cmp r2, #0
 73a:   d169        bne.n   810 <__divsc3+0x1f8>
 73c:   ee71 6a41   vsub.f32    s13, s2, s2
 740:   eeb4 1a41   vcmp.f32    s2, s2
 744:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 748:   eef4 6a66   vcmp.f32    s13, s13
 74c:   bf0c        ite eq
 74e:   2301        moveq   r3, #1
 750:   2300        movne   r3, #0
 752:   f003 0301   and.w   r3, r3, #1
 756:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 75a:   bf08        it  eq
 75c:   2300        moveq   r3, #0
 75e:   b9a3        cbnz    r3, 78a <__divsc3+0x172>
 760:   ee71 6ae1   vsub.f32    s13, s3, s3
 764:   eef4 1a61   vcmp.f32    s3, s3
 768:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 76c:   eef4 6a66   vcmp.f32    s13, s13
 770:   bf0c        ite eq
 772:   2301        moveq   r3, #1
 774:   2300        movne   r3, #0
 776:   f003 0301   and.w   r3, r3, #1
 77a:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 77e:   bf08        it  eq
 780:   2300        moveq   r3, #0
 782:   2b00        cmp r3, #0
 784:   f43f af73   beq.w   66e <__divsc3+0x56>
 788:   2300        movs    r3, #0
 78a:   eeb4 6a46   vcmp.f32    s12, s12
 78e:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 792:   f47f af6c   bne.w   66e <__divsc3+0x56>
 796:   ee70 6ae0   vsub.f32    s13, s1, s1
 79a:   eef4 6a66   vcmp.f32    s13, s13
 79e:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 7a2:   f47f af64   bne.w   66e <__divsc3+0x56>
 7a6:   2b00        cmp r3, #0
 7a8:   ee11 3a10   vmov    r3, s2
 7ac:   eef7 7a00   vmov.f32    s15, #112   ; 0x3f800000  1.0
 7b0:   eddf 6a3e   vldr    s13, [pc, #248] ; 8ac <__divsc3+0x294>
 7b4:   bf18        it  ne
 7b6:   eef0 6a67   vmovne.f32  s13, s15
 7ba:   eef4 1a61   vcmp.f32    s3, s3
 7be:   eef0 6ae6   vabs.f32    s13, s13
 7c2:   2b00        cmp r3, #0
 7c4:   ee71 7ae1   vsub.f32    s15, s3, s3
 7c8:   bfb8        it  lt
 7ca:   eef1 6a66   vneglt.f32  s13, s13
 7ce:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 7d2:   d104        bne.n   7de <__divsc3+0x1c6>
 7d4:   eef4 7a67   vcmp.f32    s15, s15
 7d8:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 7dc:   d15b        bne.n   896 <__divsc3+0x27e>
 7de:   eddf 7a33   vldr    s15, [pc, #204] ; 8ac <__divsc3+0x294>
 7e2:   ee11 3a90   vmov    r3, s3
 7e6:   eef0 7ae7   vabs.f32    s15, s15
 7ea:   ed9f 6a30   vldr    s12, [pc, #192] ; 8ac <__divsc3+0x294>
 7ee:   2b00        cmp r3, #0
 7f0:   bfb8        it  lt
 7f2:   eef1 7a67   vneglt.f32  s15, s15
 7f6:   ee20 7aa7   vmul.f32    s14, s1, s15
 7fa:   ee60 7a27   vmul.f32    s15, s0, s15
 7fe:   ee00 7a26   vmla.f32    s14, s0, s13
 802:   ee50 7aa6   vnmls.f32   s15, s1, s13
 806:   ee27 7a06   vmul.f32    s14, s14, s12
 80a:   ee67 7a86   vmul.f32    s15, s15, s12
 80e:   e72e        b.n 66e <__divsc3+0x56>
 810:   ee71 6a41   vsub.f32    s13, s2, s2
 814:   eef4 6a66   vcmp.f32    s13, s13
 818:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 81c:   d190        bne.n   740 <__divsc3+0x128>
 81e:   ee71 6ae1   vsub.f32    s13, s3, s3
 822:   eef4 6a66   vcmp.f32    s13, s13
 826:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 82a:   d19b        bne.n   764 <__divsc3+0x14c>
 82c:   2b00        cmp r3, #0
 82e:   ee10 3a10   vmov    r3, s0
 832:   eef7 7a00   vmov.f32    s15, #112   ; 0x3f800000  1.0
 836:   eddf 6a1d   vldr    s13, [pc, #116] ; 8ac <__divsc3+0x294>
 83a:   bf18        it  ne
 83c:   eef0 6a67   vmovne.f32  s13, s15
 840:   eef4 0a60   vcmp.f32    s1, s1
 844:   eef0 6ae6   vabs.f32    s13, s13
 848:   2b00        cmp r3, #0
 84a:   ee70 7ae0   vsub.f32    s15, s1, s1
 84e:   bfb8        it  lt
 850:   eef1 6a66   vneglt.f32  s13, s13
 854:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 858:   d104        bne.n   864 <__divsc3+0x24c>
 85a:   eef4 7a67   vcmp.f32    s15, s15
 85e:   eef1 fa10   vmrs    APSR_nzcv, fpscr
 862:   d11b        bne.n   89c <__divsc3+0x284>
 864:   ed9f 6a11   vldr    s12, [pc, #68]  ; 8ac <__divsc3+0x294>
 868:   ee10 3a90   vmov    r3, s1
 86c:   eeb0 6ac6   vabs.f32    s12, s12
 870:   eddf 5a0d   vldr    s11, [pc, #52]  ; 8a8 <__divsc3+0x290>
 874:   ee61 7aa6   vmul.f32    s15, s3, s13
 878:   2b00        cmp r3, #0
 87a:   bfb8        it  lt
 87c:   eeb1 6a46   vneglt.f32  s12, s12
 880:   ee21 7a86   vmul.f32    s14, s3, s12
 884:   ee51 7a06   vnmls.f32   s15, s2, s12
 888:   ee01 7a26   vmla.f32    s14, s2, s13
 88c:   ee67 7aa5   vmul.f32    s15, s15, s11
 890:   ee27 7a25   vmul.f32    s14, s14, s11
 894:   e6eb        b.n 66e <__divsc3+0x56>
 896:   eef7 7a00   vmov.f32    s15, #112   ; 0x3f800000  1.0
 89a:   e7a2        b.n 7e2 <__divsc3+0x1ca>
 89c:   eeb7 6a00   vmov.f32    s12, #112   ; 0x3f800000  1.0
 8a0:   e7e2        b.n 868 <__divsc3+0x250>
 8a2:   bf00        nop
 8a4:   ff800000    .word   0xff800000
 8a8:   7f800000    .word   0x7f800000
 8ac:   00000000    .word   0x00000000

(I'm surprised by the fact that the non-fast-math function is huge).

kng commented 1 year ago

not sure where I got it, but found this in the log

Selecting previously unselected package libvolk2.4:arm64.
Preparing to unpack .../106-libvolk2.4_2.4.1-2_arm64.deb ...
Unpacking libvolk2.4:arm64 (2.4.1-2) ...
Selecting previously unselected package libvolk2-bin.
Preparing to unpack .../107-libvolk2-bin_2.4.1-2_arm64.deb ...
Unpacking libvolk2-bin (2.4.1-2) ...
Selecting previously unselected package libvolk2-dev:arm64.
Preparing to unpack .../220-libvolk2-dev_2.4.1-2_arm64.deb ...
Unpacking libvolk2-dev:arm64 (2.4.1-2) ...
Setting up libvolk2.4:arm64 (2.4.1-2) ...
Setting up libvolk2-bin (2.4.1-2) ...
Setting up libvolk2-dev:arm64 (2.4.1-2) ...

and that is from these sources

deb http://deb.debian.org/debian bullseye main contrib non-free
deb http://security.debian.org/debian-security bullseye-security main contrib non-free
deb http://deb.debian.org/debian bullseye-updates main contrib non-free
deb http://archive.raspberrypi.org/debian/ bullseye main
deb https://download.docker.com/linux/debian bullseye stable

kng commented 1 year ago

looks like fast math is cheating :D that's quite the difference. and yeah, no wonder edge cases gets thrown out the window...

daniestevez commented 1 year ago

I think you're using this deb package, and downloaded the version for arm64. Can you check if the libvolk.so I got matches the one in your rpi4?

$ sha256sum usr/lib/aarch64-linux-gnu/libvolk.so.2.4 
a0a6a47dd20f2435519a8cc225206dd002e3645908d6aea72b2e578a63ec0d8e  usr/lib/aarch64-linux-gnu/libvolk.so.2.4

kng commented 1 year ago

$ sha256sum usr/lib/aarch64-linux-gnu/libvolk.so.2.4
a0a6a47dd20f2435519a8cc225206dd002e3645908d6aea72b2e578a63ec0d8e  usr/lib/aarch64-linux-gnu/libvolk.so.2.4
$ md5sum usr/lib/aarch64-linux-gnu/libvolk.so.2.4
db9bf2ef4d40b677fd5f2dd0f2448dae  usr/lib/aarch64-linux-gnu/libvolk.so.2.4

daniestevez commented 1 year ago

Perfect. If you want to join the disassembly fun (I don't even have an objdump for aarch64 around, so I'm using this online tool for the first time).

kng commented 1 year ago

For the kids following along at home ... me eating a candle :D

daniestevez commented 1 year ago

I'm not doing any better xD. I don't quite understand how libvolk is built, and where the kernels go in the machine code (there's volk_32fc_x2_divide_32fc_manual and volk_32fc_x2_divide_32fc_get_func_desc, but I don't know the roles they play).

Pragmatically, I think you can solve your original problem by locating the rms_agc.py file in your gr-satellites install and replacing the 1e-20 by something larger that works (I think you said 5.5e-20).

Personally I'd like to understand better why this problem happens, to check if something else outside gr-satellites should be fixed as well, and if 5.5e-20 is a safe value or could haunt us again. But I might stop soon and continue with this another day, given how late it is now.

kng commented 1 year ago

Agreed on the late hour, it's been a pleasure nonetheless and I'll try the 'fix' tomorrow and report back (:

daniestevez commented 1 year ago

Thanks!

If you also want to do another test with the small C program, try building it with -O3 -fcx-limited-range. I've found that Volk is built with this flag, and the CMakeLists.txt includes the comment "Disable complex math NaN/INFO range checking for performance".

I've tested -fcx-limited-range on my arm32, and it works (it gives zero rather than nan), but maybe in arm64 it does something different.

kng commented 1 year ago

gcc -o test test.c -O3 -fcx-limited-range
./test
0.000000e+00 + 1j 0.000000e+00

Aang23 commented 1 year ago

@daniestevez I would perhaps suggest checking for slight differences between the arm64/amd64/generic kernels in Volk. I've had to deal with a similar problem recently caused by slightly different precision between implementations. Nothing breaking the DSP, but enough to cause some code to get stuck in NaN. I ended up fixing it by writing the code in a way that'd just end up discard the NaN on the next run. Haven't tested this kernel personally, but I would say it's not unlikely.

@kng can you perhaps try reproducing the issue with the offending kernel set to "generic generic" in ~/.volk/volk_config? This will make it use the generic implementation, instead of anything architecture-specific.

daniestevez commented 1 year ago

Just so I don't lose it. The same kind of division test of (0.0+1j 0.0) / (1e-20 + 1j 0.0) but using the volk kernel:

#include <volk/volk.h>
#include <complex.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char **argv) {
    size_t alignment = volk_get_alignment();
    float complex *z = volk_malloc(512 * sizeof(float complex), alignment);
    float complex *w = volk_malloc(512 * sizeof(float complex), alignment);
    float complex *x = volk_malloc(512 * sizeof(float complex), alignment);
    memset(z, 0, 512 * sizeof(float complex));
    memset(w, 0, 512 * sizeof(float complex));
    z[0] = 0.0f;
    w[0] = 1e-20f;
    volk_32fc_x2_divide_32fc(x, z, w, 512);
    volk_32fc_x2_divide_32fc(x, z, w, 512);
    printf("%e + 1j %e\n", creal(x[0]), cimag(x[0]));
    return 0;
}

Build with gcc -O3 -Wall -o test_divide_volk test_divide_volk.c -lvolk.

For me it works both on amd64 and on arm32 (on arm32 I'm using a really old version of volk: 1.3, from Debian stretch).

kng commented 1 year ago

gcc -O3 -Wall -o test_divide_volk test_divide_volk.c -lvolk
./test_divide_volk
nan + 1j nan

daniestevez / gr-satellites

Performance issue when running on arm64 #416