Closed kng closed 1 year ago
Having the --wavfile
run never finishing looks like a serious bug. I wonder why the run with --rawint16
finishes just fine, as there is not much difference between the two, other than the --wavfile using a WAV File Source (which uses libsndfile) and --rawint16 using a File Source.
Given that the output of --dump_path
grows indefinitely with --wavfile
, I'm inclined to think that there is a bug either in the WAV File Source or in libsndfile that is making the file loop indefinitely rather than reading it once and finishing. Can you try to do a simple flowgraph to check if this is the case? You only need a WAV File source set to "Repeat: No", and send its output to a File Sink. Run this flowgraph using your test WAV file as input (you can run it in no GUI mode) and check if the output file grows indefinitely.
Unfortunately, I don't think that a potential bug in how the WAV files are read is related to your original problem about high CPU load. Maybe we're dealing with two different problems here, but to me it makes more sense to first deal with the one about the WAV file.
Unfortunately all my rpi runs headless at the moment, so hard to set up. I could build a flowgraph that replays one of the iq_dump_file from the satnogs observations to udp so it can be tested live so to say. I also don't have gr-satellites installed on the system, only in container.
Also the wav/rawint is probably just a trigger for the bug, as in the real useage I'm doing --udp --iq
as seen here.
The input path in gr_satellites is a bit different for all these, more than just the source type.
It actually runs just fine with other satellites/modes, I have narrowed it down to a few that runs bpsk style flowgraphs, including lilacsat-2, by70-2 and a few more.
You can run this test on a headless system. You just need gnuradio-companion on a Desktop machine (if possible, a similar version of GNU Radio, to avoid problems). No need to have gr-satellites installed, since the test flowgraph I proposed only uses the in-tree "WAV File Source" and "File Sink" blocks. Set this flowgraph to "No GUI" in the generate options, generate the Python script, move the script to the rpi and run it there.
It actually runs just fine with other satellites/modes, I have narrowed it down to a few that runs bpsk style flowgraphs, including lilacsat-2, by70-2 and a few more.
Have you tried supplying the input by UDP to the flowgraph at a slower rate than real time (say at x0.1)? It might be that there is a performance bottleneck for some particular demodulators or deframers that makes the embedded system unable to keep up in real time.
So I made a small flowgraph to replay satnogs iq dump here named iqdump2udp.py. Perhaps temporary, link to sample iq dump.
Run these in two terminals on the same host/container:
./iqdump2udp.py -f iq_6751281_76800.raw -s 76800
gr_satellites 45857 --samp_rate 76800 --iq --udp --udp_raw --udp_port 57356
This seems to work fine on a amd64 system, but struggles as expected on arm64. Even outputting a few seconds at 768 sps will make gr_satellites consume all cpu seemingly forever.
Can you add --dump_path
to that and check if the output size grows too much? You can compare the output size between running on amd64 and arm64. It should be very similar (perhaps not exactly the same because the symbol clock recovery might do differently due to numerical errors or other factors). If you get a much larger output on arm64, or if it seems to grow way past the size produced on amd64, something is definitely wrong here. I have the impression that more data that we feed in is getting produced/duplicated somehow, yet still I don't understand how this happens.
Repeated previous test with feeding a small amount of samples and terminating that. Letting gr_satellites run, it didn't complain on any dropped frames. Letting the --dump_path fill up to about 1GB, gives the following result:
total 1.1G
-rw-r--r-- 1 pi pi 32K Nov 18 00:12 agc_in.c64
-rw-r--r-- 1 pi pi 32K Nov 18 00:12 agc_out.c64
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 clock_recovery_err.f32
-rw-r--r-- 1 pi pi 211M Nov 18 00:13 clock_recovery_out.c64
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 clock_recovery_T_avg.f32
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 clock_recovery_T_inst.f32
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 costas_error.f32
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 costas_frequency.f32
-rw-r--r-- 1 pi pi 211M Nov 18 00:13 costas_out.c64
-rw-r--r-- 1 pi pi 106M Nov 18 00:13 costas_phase.f32
-rw-r--r-- 1 pi pi 16K Nov 18 00:12 fll_error.f32
-rw-r--r-- 1 pi pi 16K Nov 18 00:12 fll_freq.f32
-rw-r--r-- 1 pi pi 16K Nov 18 00:12 fll_phase.f32
This is running on a up to date rpi4 debian bullseye, with gr-satellites 3.15 from satnogs repo, no docker.
That listing is very useful, thanks. We can see that the AGC and FLL outputs are small (probably the same size as those obtained in amd64), while the clock recovery (Symbol Sync block) and Costas loop outputs are very large (and presumably would keep growing indefinitely).
The Costas loop is downstream of the Symbol Sync block, so I bet the culprit here is the Symbol Sync block. It seems to be producing an infinite amount of output, and the Costas loop simply processes the infinite input it's given. Since the Symbol Sync performs "variable rate resampling", it's not impossible that if it's terribly confused then it could generate an infinite amount of output from a finite input (for example if it believes that to get the next symbol it's okay to step zero input samples).
This starts to seem more like a GNU Radio bug (probably an architecture-dependent bug), rather than a gr-satellites bug. I would suggest trying to produce a minimal flowgraph that replicates the issue, and then open up an issue in the gnuradio repo. Unfortunately I don't have an arm64 system at hand, so I can't help much.
You mentioned that the FSK satellites seem to work and that the problem is with the BPSK satellites. This makes sense to me. The FSK Demodulator and the BPSK Demodulator components both use the Symbol Sync block, but with different configurations. The FSK Demodulator uses a Gardner TED and 8-tap MMSE resampler. The BPSK Demodulator uses an ML TED and polyphase MF resampler. It's probably the latter configuration the one that is broken.
I could try to make a minimal flowgraph that replicates the problem: generate a finite input (either zeros or random noise), send that into a Symbol Sync configured as the BPSK Demodulator does, and then send the output to a file. But you'll need to test this for me.
Something else: GNU Radio 3.8 is pretty much end-of-life these days, so folks will probably not fix this bug in GNU Radio 3.8. Can you try to see if the problem also happens with GNU Radio 3.10? If it doesn't, then problem solved, I guess. If the problem still happens, then we have much better chances of getting this fixed.
More ideas to debug: in the past I have found issues with the symbol recovery blocks when there are NaN
's (and perhaps also infinities, or other weird floats) in their input. The NaN
's propagate through all the loop calculations and eventually the loop error is cast to an int
to compute how many samples or polyphase filter arms to advance. Casting NaN
to an int
gives an absurd value, and this would typically cause a segfault when the int
is used as the index into an array.
I'm thinking that this is kind of thing that could be architecture-dependent in subtle ways. So maybe it's worth to check that there are no NaN
's or other weird floats at the input of the Symbol Sync.
To do this in an easy way, it's better to disable the FLL by using --disable_fll
. Then the AGC will be directly connected to the Symbol Sync, so the input to the Symbol Sync will be the agc_out.c64
file. It might happen that if you use --disable_fll
then the problem disappears. This would strongly point to the FLL as the culprit. But probably the problem will still persist with the FLL disabled. In that case, can you either check that in agc_out.c64
(it's a complex64 file) there are no NaN's, infinities, etc, or alternatively attach the file here so that I can check it?
Tested running gr_satellites 45857 --samp_rate 76800 --iq --udp --udp_raw --udp_port 57356 --disable_fll --dump_path dump
with the same short burst of udp data.
Resulted in pretty much the same load scenario, and filled up dump path to about 1GB and terminated.
total 1.1G
-rw-r--r-- 1 pi pi 32K Nov 18 21:56 agc_in.c64
-rw-r--r-- 1 pi pi 32K Nov 18 21:56 agc_out.c64
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 clock_recovery_err.f32
-rw-r--r-- 1 pi pi 221M Nov 18 21:56 clock_recovery_out.c64
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 clock_recovery_T_avg.f32
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 clock_recovery_T_inst.f32
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 costas_error.f32
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 costas_frequency.f32
-rw-r--r-- 1 pi pi 221M Nov 18 21:56 costas_out.c64
-rw-r--r-- 1 pi pi 111M Nov 18 21:56 costas_phase.f32
Great. Can you share the agc_out.c64
file here?
agc_inout.zip had to compress it to attach, both in/out.
In [1]: import numpy as np
In [2]: agc_out = np.fromfile('agc_out.c64', 'complex64')
In [3]: agc_out = np.fromfile('agc_out.c64', 'complex64')
In [4]: agc_out
Out[4]:
array([ nan +nanj, nan +nanj,
nan +nanj, ..., -0.07814392-0.05673159j,
-0.1532042 +0.32974455j, -0.45565328+1.1761963j ], dtype=complex64)
These nan
don't look good.
It's not going to be a GNU Radio bug, really (well, arguably the Symbol Sync block shouldn't do silly things if it gets nan
's in the input). This is a huge deja vù of a bug in the gr-satellites AGC that I solved many years ago.
In [10]: agc_in[:20]
Out[10]:
array([0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
0.0000000e+00+0.0000000e+00j, 0.0000000e+00+0.0000000e+00j,
2.7911843e-22-1.8607895e-22j, 9.8180180e-08-6.5453456e-08j],
dtype=complex64)
In [11]: agc_out[:20]
Out[11]:
array([ nan +nanj, nan +nanj, nan +nanj,
nan +nanj, nan +nanj, nan +nanj,
nan +nanj, nan +nanj, nan +nanj,
nan +nanj, nan +nanj, nan +nanj,
nan +nanj, nan +nanj, nan +nanj,
nan +nanj, nan +nanj, nan +nanj,
-inf +infj, 16.641005-11.094004j], dtype=complex64)
The problem is that the input starts with 18 zeros, for which the AGC produces nan
as output. This basically happens because of 0/0
. The AGC adds a tiny number on the denominator to avoid 0/0
. But maybe that number is too tiny for arm64. Or something is built with -fast-math
. I'm going to investigate a bit more to try to understand how this can happen in arm64 (I'm confident that on amd64 it doesn't happen, an I'm going to check it now).
Ok, sound like it's possible to test, even with larger tiny numbers. pse advice how to modify and test this.
Here is a test flowgraph. It feeds 16 zeros to the RMS AGC block from gr-satellites and then prints the output on the screen.
On my amd64 machine this prints:
$ ./agc_test.py
vmcircbuf_sysconfig :info: Using gr::vmcircbuf_sysv_shm_factory
agc_test_output = [0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j]
I'm doing this in GNU Radio 3.10, but hopefully it will work in your GNU Radio 3.8 without modifications. You can scp it to the rpi4 and run grcc agc_test.grc
to compile it to a Python script.
Can you test if this gives nan
in the rpi4?
grcc agc_test.grc
>>> Warning: vocoder_codec2_decode_ps - option_attributes are for enums only, ignoring
>>> Warning: vocoder_codec2_encode_sp - option_attributes are for enums only, ignoring
<<< Welcome to GNU Radio Companion Compiler 3.8.2.0 >>>
Block paths:
/usr/share/gnuradio/grc/blocks
>>> Loading: /home/pi/iqdump/agc_test.grc
>>> Generating: /home/pi/iqdump/agc_test.py
>>> Warning: This flow graph may not have flow control: no audio or RF hardware blocks found. Add a Misc->Throttle block to your flow graphto avoid CPU congestion.
./agc_test.py
agc_test_output = ((nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj))
Great. As expected, arm64 gives nan. We're narrowing the problem.
Here is another simple test.
This generates 16 zeros and adds to them the small number 1e-20
using the Add Const block. This is what would happen in the RMS AGC block to avoid getting exactly zero on the denominator of the AGC.
In my amd64 machine this gives almost 1e-20
, as expected.
]$ ./add_const_test.py
vmcircbuf_sysconfig :info: Using gr::vmcircbuf_sysv_shm_factory
add_const_test_output = [9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21]
Can you test on the rpi4? I wouldn't be surprised if this gives exactly zero instead of 1e-20
.
./add_const_test.py
add_const_test_output = (9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21)
In case this becomes relevant: I've just noticed that in GNU Radio 3.8 Add Const doesn't use Volk (it uses a partially unrolled C++ loop), while in GNU Radio 3.10 it uses the volk_32f_x2_add_32f
kernel.
add_const_test_output = (9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21)
Ok. That's slightly surprising. I'm going to do a better test that captures and prints all the intermediate calculations of the RMS AGC. Stand by a few minutes.
Here is the test:
And this is what I get on my amd64 machine:
$ ./rms_agc_test.py
vmcircbuf_sysconfig :info: Using gr::vmcircbuf_sysv_shm_factory
in_data [0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j]
rms_out [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
multiply_const_out [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
add_const_out [9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21]
float_to_complex_out [(9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j)]
divide_out [0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j]
./rms_agc_test.py
in_data (0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j)
rms_out (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
multiply_const_out (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
add_const_out (9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21, 9.999999682655225e-21)
float_to_complex_out ((9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j), (9.999999682655225e-21+0j))
divide_out ((nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj), (nan+nanj))
Okay. So it's the Divide block the one that is broken. This should be computing 0j / (9.999999682655225e-21+0j)
, which is 0
, but it's giving nan
instead. I'm going to take a look at the code to see where to put the blame.
Meanwhile, if you want to play, you can change the number 1e-20
in the last test (locate it in the .py or the .grc) and make it larger until the divide_out
doesn't give nan
. Maybe knowing where this breaks will give us a useful clue.
The Divide block for gr_complex
data hasn't changed at all between 3.8 and 3.10. In both cases it uses volk_32fc_x2_divide_32fc
.
Can you paste your $HOME/.volk/volk_config
here? Also, which version of Volk are you running?
self.blocks_add_const_vxx_0 = blocks.add_const_ff(5.5e-20)
produces divide_out (0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j, 0j)
and 5.4 does NaN .
ii libvolk2-bin 2.4.1-2 arm64 vector optimized runtime tools
ii libvolk2-dev:arm64 2.4.1-2 arm64 vector optimized function headers
un libvolk2-doc <none> <none> (no description available)
ii libvolk2.4:arm64 2.4.1-2 arm64 vector optimized functions
I thought I just ran volk_profile, but that might have been on another system. running now...
If you didn't have $HOME/.volk/volk_config
I think that Volk just uses the generic kernel.
Rather than running volk_profile
(which could take quite a while, at least from what I'm used to in slower arm32 boards), I prefer that you edit your volk_config
by hand and enter the following contents (these are the sole contents of the file):
volk_32fc_x2_divide_32fc generic
Then run the test.
Then replace the volk_config
by
volk_32fc_x2_divide_32fc neon
With these we will check the generic and the NEON kernels, and see if they make any difference.
That was fast! Nice rpi4.
I see the following:
volk_32fc_x2_divide_32fc neon neon
This makes me realize that I messed up the syntax in my previous comment. We need to list 2 kernels: The one for aligned data and the one for unaligned data.
Can you run the rms_agc_test.py
again to ensure that it fails when using the NEON kernel and then modify that line in volk_config
to
volk_32fc_x2_divide_32fc generic generic
and run rms_agc_test.py
to check what happens with the generic kernel?
volk_profile is ~5min to plow through, not sure if it's iterations or time. tried all four combinations and produces identical results according to diff. (with 1e-20)
Since the generic test (which is plain old division of complex number in C) fails, as a sanity check, take the following simple program:
Edit: Disregard this program. The program is so simple that gcc already knows that the answer is 0 and doesn't even bother to do the math. I'll paste now something more sophisticate to force it.
#include <complex.h>
#include <stdio.h>
int main(int argc, char **argv) {
float complex z = 0.0f;
float complex w = 1e-20f;
float complex x = z / w;
printf("%e + 1j %e\n", creal(x), cimag(x));
return 0;
}
Can you build and test this. It should print 0.000000e+00 + 1j 0.000000e+00
. For me it does, both in an amd64 machine, as well as in a Beaglebone Black (arm32).
Can you test this on the rpi4? Try building it with -O3
, and maybe -ffast-math
to try to get it to fail.
as you said, it'll produce 0.000000e+00 + 1j 0.000000e+00
regardless...
Awesome. I now have a test that breaks even on amd64 with -ffast-math
(without -ffast-math
it does work).
#include <complex.h>
#include <stdio.h>
float complex __attribute__ ((noinline)) divide(float complex z, float complex w) {
return z / w;
}
int main(int argc, char **argv) {
float complex z = 0.0f;
float complex w = 1e-20f;
float complex x = divide(z, w);
printf("%e + 1j %e\n", creal(x), cimag(x));
return 0;
}
Let me know if this works with -O3
and with -O3 -ffast-math
.
Same results on the Beaglebone black. With -O3 -ffast-math
it gives nan
, while with -O3
only it gives zero as expected.
yes, -ffast-math
gives nan, the others ~0
Could it be that your libvolk2-bin
debian package was built with -ffast-math
I wonder? Where does this package come from exactly? If I'm able to grab that exact .deb
package I might look at the assembler code and compare to what I get with -ffast-math
and without on arm32.
For the record, on arm32, this is the divide
function built with -O3 -ffast-math
:
0000062c <divide>:
62c: ee20 7a21 vmul.f32 s14, s0, s3
630: ee61 7aa1 vmul.f32 s15, s3, s3
634: ee60 1aa1 vmul.f32 s3, s1, s3
638: ee41 7a01 vmla.f32 s15, s2, s2
63c: ee10 7a81 vnmls.f32 s14, s1, s2
640: ee40 1a01 vmla.f32 s3, s0, s2
644: eec7 0a27 vdiv.f32 s1, s14, s15
648: ee81 0aa7 vdiv.f32 s0, s3, s15
64c: 4770 bx lr
64e: bf00 nop
And this is with -O3
only:
00000614 <divide>:
614: f000 b800 b.w 618 <__divsc3>
00000618 <__divsc3>:
618: eeb0 7ac1 vabs.f32 s14, s2
61c: eef0 7ae1 vabs.f32 s15, s3
620: eeb4 7ae7 vcmpe.f32 s14, s15
624: eef1 fa10 vmrs APSR_nzcv, fpscr
628: d526 bpl.n 678 <__divsc3+0x60>
62a: eec1 7a21 vdiv.f32 s15, s2, s3
62e: eef0 6a61 vmov.f32 s13, s3
632: eef0 5a60 vmov.f32 s11, s1
636: eeb0 6a40 vmov.f32 s12, s0
63a: ee41 6a27 vmla.f32 s13, s2, s15
63e: ee47 5a80 vmla.f32 s11, s15, s0
642: ee17 6aa0 vnmls.f32 s12, s15, s1
646: ee85 7aa6 vdiv.f32 s14, s11, s13
64a: eec6 7a26 vdiv.f32 s15, s12, s13
64e: eef4 7a67 vcmp.f32 s15, s15
652: eef1 fa10 vmrs APSR_nzcv, fpscr
656: eeb4 7a47 vcmp.f32 s14, s14
65a: bf14 ite ne
65c: 2301 movne r3, #1
65e: 2300 moveq r3, #0
660: f003 0301 and.w r3, r3, #1
664: eef1 fa10 vmrs APSR_nzcv, fpscr
668: bf08 it eq
66a: 2300 moveq r3, #0
66c: b9bb cbnz r3, 69e <__divsc3+0x86>
66e: eeb0 0a47 vmov.f32 s0, s14
672: eef0 0a67 vmov.f32 s1, s15
676: 4770 bx lr
678: eec1 7a81 vdiv.f32 s15, s3, s2
67c: eef0 6a41 vmov.f32 s13, s2
680: eef0 5a40 vmov.f32 s11, s0
684: eeb0 6a60 vmov.f32 s12, s1
688: ee41 6aa7 vmla.f32 s13, s3, s15
68c: ee40 5aa7 vmla.f32 s11, s1, s15
690: ee00 6a67 vmls.f32 s12, s0, s15
694: ee85 7aa6 vdiv.f32 s14, s11, s13
698: eec6 7a26 vdiv.f32 s15, s12, s13
69c: e7d7 b.n 64e <__divsc3+0x36>
69e: eeb5 1a40 vcmp.f32 s2, #0.0
6a2: eef1 fa10 vmrs APSR_nzcv, fpscr
6a6: eef5 1a40 vcmp.f32 s3, #0.0
6aa: bf0c ite eq
6ac: 2301 moveq r3, #1
6ae: 2300 movne r3, #0
6b0: f003 0301 and.w r3, r3, #1
6b4: eef1 fa10 vmrs APSR_nzcv, fpscr
6b8: bf18 it ne
6ba: 2300 movne r3, #0
6bc: b1c3 cbz r3, 6f0 <__divsc3+0xd8>
6be: eef4 0a60 vcmp.f32 s1, s1
6c2: eef1 fa10 vmrs APSR_nzcv, fpscr
6c6: d004 beq.n 6d2 <__divsc3+0xba>
6c8: eeb4 0a40 vcmp.f32 s0, s0
6cc: eef1 fa10 vmrs APSR_nzcv, fpscr
6d0: d10e bne.n 6f0 <__divsc3+0xd8>
6d2: ee11 3a10 vmov r3, s2
6d6: ed9f 7a73 vldr s14, [pc, #460] ; 8a4 <__divsc3+0x28c>
6da: eddf 7a73 vldr s15, [pc, #460] ; 8a8 <__divsc3+0x290>
6de: 2b00 cmp r3, #0
6e0: bfb8 it lt
6e2: eef0 7a47 vmovlt.f32 s15, s14
6e6: ee20 7a27 vmul.f32 s14, s0, s15
6ea: ee60 7aa7 vmul.f32 s15, s1, s15
6ee: e7be b.n 66e <__divsc3+0x56>
6f0: ee30 6a40 vsub.f32 s12, s0, s0
6f4: eeb4 0a40 vcmp.f32 s0, s0
6f8: eef1 fa10 vmrs APSR_nzcv, fpscr
6fc: eeb4 6a46 vcmp.f32 s12, s12
700: bf0c ite eq
702: 2301 moveq r3, #1
704: 2300 movne r3, #0
706: f003 0301 and.w r3, r3, #1
70a: eef1 fa10 vmrs APSR_nzcv, fpscr
70e: bf08 it eq
710: 2300 moveq r3, #0
712: 2b00 cmp r3, #0
714: d17c bne.n 810 <__divsc3+0x1f8>
716: ee70 6ae0 vsub.f32 s13, s1, s1
71a: eef4 0a60 vcmp.f32 s1, s1
71e: eef1 fa10 vmrs APSR_nzcv, fpscr
722: eef4 6a66 vcmp.f32 s13, s13
726: bf0c ite eq
728: 2201 moveq r2, #1
72a: 2200 movne r2, #0
72c: f002 0201 and.w r2, r2, #1
730: eef1 fa10 vmrs APSR_nzcv, fpscr
734: bf08 it eq
736: 2200 moveq r2, #0
738: 2a00 cmp r2, #0
73a: d169 bne.n 810 <__divsc3+0x1f8>
73c: ee71 6a41 vsub.f32 s13, s2, s2
740: eeb4 1a41 vcmp.f32 s2, s2
744: eef1 fa10 vmrs APSR_nzcv, fpscr
748: eef4 6a66 vcmp.f32 s13, s13
74c: bf0c ite eq
74e: 2301 moveq r3, #1
750: 2300 movne r3, #0
752: f003 0301 and.w r3, r3, #1
756: eef1 fa10 vmrs APSR_nzcv, fpscr
75a: bf08 it eq
75c: 2300 moveq r3, #0
75e: b9a3 cbnz r3, 78a <__divsc3+0x172>
760: ee71 6ae1 vsub.f32 s13, s3, s3
764: eef4 1a61 vcmp.f32 s3, s3
768: eef1 fa10 vmrs APSR_nzcv, fpscr
76c: eef4 6a66 vcmp.f32 s13, s13
770: bf0c ite eq
772: 2301 moveq r3, #1
774: 2300 movne r3, #0
776: f003 0301 and.w r3, r3, #1
77a: eef1 fa10 vmrs APSR_nzcv, fpscr
77e: bf08 it eq
780: 2300 moveq r3, #0
782: 2b00 cmp r3, #0
784: f43f af73 beq.w 66e <__divsc3+0x56>
788: 2300 movs r3, #0
78a: eeb4 6a46 vcmp.f32 s12, s12
78e: eef1 fa10 vmrs APSR_nzcv, fpscr
792: f47f af6c bne.w 66e <__divsc3+0x56>
796: ee70 6ae0 vsub.f32 s13, s1, s1
79a: eef4 6a66 vcmp.f32 s13, s13
79e: eef1 fa10 vmrs APSR_nzcv, fpscr
7a2: f47f af64 bne.w 66e <__divsc3+0x56>
7a6: 2b00 cmp r3, #0
7a8: ee11 3a10 vmov r3, s2
7ac: eef7 7a00 vmov.f32 s15, #112 ; 0x3f800000 1.0
7b0: eddf 6a3e vldr s13, [pc, #248] ; 8ac <__divsc3+0x294>
7b4: bf18 it ne
7b6: eef0 6a67 vmovne.f32 s13, s15
7ba: eef4 1a61 vcmp.f32 s3, s3
7be: eef0 6ae6 vabs.f32 s13, s13
7c2: 2b00 cmp r3, #0
7c4: ee71 7ae1 vsub.f32 s15, s3, s3
7c8: bfb8 it lt
7ca: eef1 6a66 vneglt.f32 s13, s13
7ce: eef1 fa10 vmrs APSR_nzcv, fpscr
7d2: d104 bne.n 7de <__divsc3+0x1c6>
7d4: eef4 7a67 vcmp.f32 s15, s15
7d8: eef1 fa10 vmrs APSR_nzcv, fpscr
7dc: d15b bne.n 896 <__divsc3+0x27e>
7de: eddf 7a33 vldr s15, [pc, #204] ; 8ac <__divsc3+0x294>
7e2: ee11 3a90 vmov r3, s3
7e6: eef0 7ae7 vabs.f32 s15, s15
7ea: ed9f 6a30 vldr s12, [pc, #192] ; 8ac <__divsc3+0x294>
7ee: 2b00 cmp r3, #0
7f0: bfb8 it lt
7f2: eef1 7a67 vneglt.f32 s15, s15
7f6: ee20 7aa7 vmul.f32 s14, s1, s15
7fa: ee60 7a27 vmul.f32 s15, s0, s15
7fe: ee00 7a26 vmla.f32 s14, s0, s13
802: ee50 7aa6 vnmls.f32 s15, s1, s13
806: ee27 7a06 vmul.f32 s14, s14, s12
80a: ee67 7a86 vmul.f32 s15, s15, s12
80e: e72e b.n 66e <__divsc3+0x56>
810: ee71 6a41 vsub.f32 s13, s2, s2
814: eef4 6a66 vcmp.f32 s13, s13
818: eef1 fa10 vmrs APSR_nzcv, fpscr
81c: d190 bne.n 740 <__divsc3+0x128>
81e: ee71 6ae1 vsub.f32 s13, s3, s3
822: eef4 6a66 vcmp.f32 s13, s13
826: eef1 fa10 vmrs APSR_nzcv, fpscr
82a: d19b bne.n 764 <__divsc3+0x14c>
82c: 2b00 cmp r3, #0
82e: ee10 3a10 vmov r3, s0
832: eef7 7a00 vmov.f32 s15, #112 ; 0x3f800000 1.0
836: eddf 6a1d vldr s13, [pc, #116] ; 8ac <__divsc3+0x294>
83a: bf18 it ne
83c: eef0 6a67 vmovne.f32 s13, s15
840: eef4 0a60 vcmp.f32 s1, s1
844: eef0 6ae6 vabs.f32 s13, s13
848: 2b00 cmp r3, #0
84a: ee70 7ae0 vsub.f32 s15, s1, s1
84e: bfb8 it lt
850: eef1 6a66 vneglt.f32 s13, s13
854: eef1 fa10 vmrs APSR_nzcv, fpscr
858: d104 bne.n 864 <__divsc3+0x24c>
85a: eef4 7a67 vcmp.f32 s15, s15
85e: eef1 fa10 vmrs APSR_nzcv, fpscr
862: d11b bne.n 89c <__divsc3+0x284>
864: ed9f 6a11 vldr s12, [pc, #68] ; 8ac <__divsc3+0x294>
868: ee10 3a90 vmov r3, s1
86c: eeb0 6ac6 vabs.f32 s12, s12
870: eddf 5a0d vldr s11, [pc, #52] ; 8a8 <__divsc3+0x290>
874: ee61 7aa6 vmul.f32 s15, s3, s13
878: 2b00 cmp r3, #0
87a: bfb8 it lt
87c: eeb1 6a46 vneglt.f32 s12, s12
880: ee21 7a86 vmul.f32 s14, s3, s12
884: ee51 7a06 vnmls.f32 s15, s2, s12
888: ee01 7a26 vmla.f32 s14, s2, s13
88c: ee67 7aa5 vmul.f32 s15, s15, s11
890: ee27 7a25 vmul.f32 s14, s14, s11
894: e6eb b.n 66e <__divsc3+0x56>
896: eef7 7a00 vmov.f32 s15, #112 ; 0x3f800000 1.0
89a: e7a2 b.n 7e2 <__divsc3+0x1ca>
89c: eeb7 6a00 vmov.f32 s12, #112 ; 0x3f800000 1.0
8a0: e7e2 b.n 868 <__divsc3+0x250>
8a2: bf00 nop
8a4: ff800000 .word 0xff800000
8a8: 7f800000 .word 0x7f800000
8ac: 00000000 .word 0x00000000
(I'm surprised by the fact that the non-fast-math function is huge).
not sure where I got it, but found this in the log
Selecting previously unselected package libvolk2.4:arm64.
Preparing to unpack .../106-libvolk2.4_2.4.1-2_arm64.deb ...
Unpacking libvolk2.4:arm64 (2.4.1-2) ...
Selecting previously unselected package libvolk2-bin.
Preparing to unpack .../107-libvolk2-bin_2.4.1-2_arm64.deb ...
Unpacking libvolk2-bin (2.4.1-2) ...
Selecting previously unselected package libvolk2-dev:arm64.
Preparing to unpack .../220-libvolk2-dev_2.4.1-2_arm64.deb ...
Unpacking libvolk2-dev:arm64 (2.4.1-2) ...
Setting up libvolk2.4:arm64 (2.4.1-2) ...
Setting up libvolk2-bin (2.4.1-2) ...
Setting up libvolk2-dev:arm64 (2.4.1-2) ...
and that is from these sources
deb http://deb.debian.org/debian bullseye main contrib non-free
deb http://security.debian.org/debian-security bullseye-security main contrib non-free
deb http://deb.debian.org/debian bullseye-updates main contrib non-free
deb http://archive.raspberrypi.org/debian/ bullseye main
deb https://download.docker.com/linux/debian bullseye stable
looks like fast math is cheating :D that's quite the difference. and yeah, no wonder edge cases gets thrown out the window...
I think you're using this deb package, and downloaded the version for arm64. Can you check if the libvolk.so
I got matches the one in your rpi4?
$ sha256sum usr/lib/aarch64-linux-gnu/libvolk.so.2.4
a0a6a47dd20f2435519a8cc225206dd002e3645908d6aea72b2e578a63ec0d8e usr/lib/aarch64-linux-gnu/libvolk.so.2.4
$ sha256sum usr/lib/aarch64-linux-gnu/libvolk.so.2.4
a0a6a47dd20f2435519a8cc225206dd002e3645908d6aea72b2e578a63ec0d8e usr/lib/aarch64-linux-gnu/libvolk.so.2.4
$ md5sum usr/lib/aarch64-linux-gnu/libvolk.so.2.4
db9bf2ef4d40b677fd5f2dd0f2448dae usr/lib/aarch64-linux-gnu/libvolk.so.2.4
Perfect. If you want to join the disassembly fun (I don't even have an objdump
for aarch64
around, so I'm using this online tool for the first time).
For the kids following along at home ... me eating a candle :D
I'm not doing any better xD. I don't quite understand how libvolk is built, and where the kernels go in the machine code (there's volk_32fc_x2_divide_32fc_manual
and volk_32fc_x2_divide_32fc_get_func_desc
, but I don't know the roles they play).
Pragmatically, I think you can solve your original problem by locating the rms_agc.py
file in your gr-satellites install and replacing the 1e-20
by something larger that works (I think you said 5.5e-20
).
Personally I'd like to understand better why this problem happens, to check if something else outside gr-satellites should be fixed as well, and if 5.5e-20
is a safe value or could haunt us again. But I might stop soon and continue with this another day, given how late it is now.
Agreed on the late hour, it's been a pleasure nonetheless and I'll try the 'fix' tomorrow and report back (:
Thanks!
If you also want to do another test with the small C program, try building it with -O3 -fcx-limited-range
. I've found that Volk is built with this flag, and the CMakeLists.txt
includes the comment "Disable complex math NaN/INFO range checking for performance".
I've tested -fcx-limited-range
on my arm32, and it works (it gives zero rather than nan), but maybe in arm64 it does something different.
gcc -o test test.c -O3 -fcx-limited-range
./test
0.000000e+00 + 1j 0.000000e+00
@daniestevez I would perhaps suggest checking for slight differences between the arm64/amd64/generic kernels in Volk. I've had to deal with a similar problem recently caused by slightly different precision between implementations. Nothing breaking the DSP, but enough to cause some code to get stuck in NaN. I ended up fixing it by writing the code in a way that'd just end up discard the NaN on the next run. Haven't tested this kernel personally, but I would say it's not unlikely.
@kng can you perhaps try reproducing the issue with the offending kernel set to "generic generic" in ~/.volk/volk_config? This will make it use the generic implementation, instead of anything architecture-specific.
Just so I don't lose it. The same kind of division test of (0.0+1j 0.0) / (1e-20 + 1j 0.0) but using the volk kernel:
#include <volk/volk.h>
#include <complex.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv) {
size_t alignment = volk_get_alignment();
float complex *z = volk_malloc(512 * sizeof(float complex), alignment);
float complex *w = volk_malloc(512 * sizeof(float complex), alignment);
float complex *x = volk_malloc(512 * sizeof(float complex), alignment);
memset(z, 0, 512 * sizeof(float complex));
memset(w, 0, 512 * sizeof(float complex));
z[0] = 0.0f;
w[0] = 1e-20f;
volk_32fc_x2_divide_32fc(x, z, w, 512);
volk_32fc_x2_divide_32fc(x, z, w, 512);
printf("%e + 1j %e\n", creal(x[0]), cimag(x[0]));
return 0;
}
Build with gcc -O3 -Wall -o test_divide_volk test_divide_volk.c -lvolk
.
For me it works both on amd64 and on arm32 (on arm32 I'm using a really old version of volk: 1.3, from Debian stretch).
gcc -O3 -Wall -o test_divide_volk test_divide_volk.c -lvolk
./test_divide_volk
nan + 1j nan
I've been struggling with my docker images that I prepared for satnogs-client and gr-satellites lately. It looked like everything was working fine, then I noticed quite high cpu load on some observations (bpsk 9k6), also no demods on those.
Packaged release for satnogs here. I did build it from source as well, no change, branch maint-3.8 as it's debian bullseye.
It complains on too much udp data and dropping packets, but I think this is due to the load/hang so it's a symptom.
gr::log :WARN: udp_source0 - Too much data; dropping packet.
We have solved this issue earlier with the udp.conf.We did some experiments, tried running
volk_profile
and messing around quite a bit but without success. I did stumble on a repeatable process to test it: Fetch audio from https://network.satnogs.org/observations/6746907/sox satnogs_6746907_2022-11-15T15-31-31.ogg satnogs_6746907_2022-11-15T15-31-31.wav
time gr_satellites 43199 --rawint16 satnogs_6746907_2022-11-15T15-31-31.wav --samp_rate 48e3
time gr_satellites 43199 --wavfile satnogs_6746907_2022-11-15T15-31-31.wav --samp_rate 48e3
On amd64 both completes in a few seconds. On arm64 the
--wavfile
just stalls, I left it for 10h and no progress,--rawint16
completes in about 20s on a rpi4. Tested with v3.15 and v5.1.0 on amd64 and v3.15 on arm64. (Technically--samp_rate
not needed for--wavfile
but it didn't seem to do any harm)I also tested running with
--dump_path
on arm64 on both--rawint16
and--wavfile
, the latter just grew and grew in size.htop while running a bpsk 9k6 obs, by70-2 iirc:
htop while running similar obs on amd64:
edit, forgot the screendump on the file running on rpi4: