Closed gonesurfing closed 1 year ago
Hi @gonesurfing,
Thanks for your bug report.
First off, good job on solving the first USB buffering issue, although I will note that I don't recall doing this on my system. I'm not sure what my Jetpack version is, but I will check. If I saw this error message, I would try --num_samp 131072
, halving the number of samples queried from the SDR over USB on each call. I don't think that would solve your buffer filling up issue, though. Let's tackle that, because it will be a real pain in operation if we don't fix it.
You're right that the defaults require tuning. I tuned them on my Jetson Nano 4GB to easily keep the SDR buffers drained, so it's surprising to me that you encountered this warning. Thanks for including debugging output. Can you post the rest of it? If you're shy about posting a lot of output, you can hide it like this:
<details><summary>Details</summary>Looooooong output</details>
I reduced the sample size, but the buffers actually seem to fill up a little quicker than the original value.
$ python effex/effex/effex.py --time 15 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 131072 --resolution 4096 --gain 29.7 --mode spectrum --loglevel=DEBUG
Here is the full output. I changed to 15s so that it's more manageable.
Yep, as expected --num_samp 131072
will make the SDR buffers fill up faster. It takes less time to read less data over USB on each call. That would have been a way to address your very first error message, which was a Linux config issue.
I'm suspecting something is throttling your GPU's throughput. If it were processing SDR samples quickly enough, the output buffer would fill up as well. This is the "Correlation buffer" the log is talking about, where data is held before writing it to the output file on 1 second intervals. Let's take the data logging out of the equation temporarily to make sure it's not that, though. After that we can look at GPU performance.
Comment out these lines, effex.py:451-452
. It should look like this:
# output_thread.start()
# self.logger.debug('Starting output buffering thread.')
Run this with the option --omit_plot
, too, since there will be no data in the output .csv for it to plot. It may still throw a warning about that, but you can ignore it.
Same result, except the output buffer never drains.
FYI, I am using a micro USB power supply, but the OS does not seem to be complaining about overcurrent and throttling. The first couple power supplies I tried, did in-fact result in throttling with lots of messages in dmesg.
Here's a snapshot of tegrastats while effex is running.
RAM 1847/3964MB (lfb 5x4MB) SWAP 0/1982MB (cached 0MB) CPU [8%@1479,19%@1479,35%@1479,40%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@31.5C CPU@35C PMIC@50C GPU@32C AO@40.5C thermal@33.25C
I've also installed jtop, and can see that it's running in MAXN power mode. I tried running "Jetson_clocks" to see if it helped and no difference was noted.
Can you try sudo tegrastats --interval 100
while effex
is running, and send more lines? That 0% number next to GR3D is suspicious, meaning either <1% of the GPU is actually in use, or you just caught it when it wasn't processing any data.
I'm looking for GR3D_FREQ, something looking like GR3D_FREQ 0%@921
to check both utilization and GPU clock frequency, and VDD_CPU and VDD_GPU to check power consumption in mW. This should tell us how your GPU is doing. Your beefier USB power supply may be good enough to avoid brown-outs and keep things in dmesg
happy, but I know this thing has more than one power mode. I think the higher 10W mode is the default, but there is also a 5W mode. Being in 5W mode will throttle the CPU/GPU clock speeds for sure.
For what it's worth, my Jetson is powered via the barrel jack by a 5V 5A power supply, although I doubt it's drawing that many amps under load. I have not yet run into any issues running the software and powering 2x Nooelec low noise amplifiers via the RTL-SDR bias-tees.
Same result, except the output buffer never drains.
FYI, I am using a micro USB power supply, but the OS does not seem to be complaining about overcurrent and throttling. The first couple power supplies I tried, did in-fact result in throttling with lots of messages in dmesg.
Here's a snapshot of tegrastats while effex is running.
RAM 1847/3964MB (lfb 5x4MB) SWAP 0/1982MB (cached 0MB) CPU [8%@1479,19%@1479,35%@1479,40%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@31.5C CPU@35C PMIC@50C GPU@32C AO@40.5C thermal@33.25C
I've also installed jtop, and can see that it's running in MAXN power mode. I tried running "Jetson_clocks" to see if it helped and no difference was noted. python effex.py --time 15 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 131072 --resolution 4096 --gain 29.7 --mode spectrum --loglevel=DEBUG --omit_plot=True
Ok, thanks for checking the power mode and trying jetson_clocks
. Barring unhappy results from tegrastats
, I see no reason why you shouldn't also have good performance. What make/model are the SDR devices, out of curiosity?
I ordered a barrel jack power supply to test, but the Jetson is definitely running in 10W mode (MAXN) with no brownouts.
Just to make sure there wasn't something silly going on with having no input signal, I put a noise source with a y-splitter on both RTL-SDRv3 inputs. No noticeable change.
There is a little GPU activity, but it's very sporadic. Maybe every 1s there is a 10-90% usage. It seems that there is a thread hogging CPU usage, as each stat frame has one CPU nearly maxed out. Just speculating, but maybe there is a dependency that is running on the CPU instead of the GPU?
I'm going to try and dig through the code this weekend, and see if I can identify the function(s) that are bogging things down.
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [90%@1479,16%@1479,0%@1479,9%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 7%@768 APE 25 PLL@28C CPU@32C PMIC@50C GPU@28.5C AO@37C thermal@30.25C POM_5V_IN 2746/2911 POM_5V_GPU 159/161 POM_5V_CPU 796/989
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,20%@1479,80%@1479,9%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28C CPU@32C PMIC@50C GPU@29C AO@36.5C thermal@30.25C POM_5V_IN 2980/2911 POM_5V_GPU 238/162 POM_5V_CPU 912/989
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [9%@1479,0%@1479,70%@1479,30%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28C CPU@31.5C PMIC@50C GPU@29C AO@37C thermal@30.25C POM_5V_IN 2631/2910 POM_5V_GPU 119/161 POM_5V_CPU 756/988
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,0%@1479,0%@1479,100%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28C CPU@31.5C PMIC@50C GPU@29C AO@36.5C thermal@30.25C POM_5V_IN 2631/2909 POM_5V_GPU 119/161 POM_5V_CPU 757/987
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [16%@1479,10%@1479,72%@1479,27%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28C CPU@31.5C PMIC@50C GPU@29C AO@36.5C thermal@30.25C POM_5V_IN 2941/2909 POM_5V_GPU 158/161 POM_5V_CPU 1111/988
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,27%@1479,18%@1479,70%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28.5C CPU@32C PMIC@50C GPU@29.5C AO@37C thermal@30.25C POM_5V_IN 3055/2910 POM_5V_GPU 277/162 POM_5V_CPU 952/988
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [9%@1479,9%@1479,0%@1479,83%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28C CPU@31.5C PMIC@50C GPU@28.5C AO@36.5C thermal@30.25C POM_5V_IN 2591/2909 POM_5V_GPU 119/162 POM_5V_CPU 717/987
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [9%@1479,100%@1479,0%@1479,0%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28C CPU@33C PMIC@50C GPU@29.5C AO@36.5C thermal@30.25C POM_5V_IN 2671/2908 POM_5V_GPU 119/161 POM_5V_CPU 796/986
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,60%@1479,9%@1479,30%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 99%@768 APE 25 PLL@28C CPU@32C PMIC@50C GPU@29C AO@37C thermal@30.25C POM_5V_IN 3282/2909 POM_5V_GPU 474/162 POM_5V_CPU 909/986
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [9%@1479,0%@1479,0%@1479,100%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28.5C CPU@31.5C PMIC@50C GPU@29.5C AO@36.5C thermal@30.25C POM_5V_IN 2706/2908 POM_5V_GPU 159/162 POM_5V_CPU 796/985
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [0%@1479,0%@1479,81%@1479,20%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 0%@768 APE 25 PLL@28.5C CPU@31.5C PMIC@50C GPU@29C AO@37C thermal@30.25C POM_5V_IN 2706/2908 POM_5V_GPU 119/162 POM_5V_CPU 835/985
RAM 1464/3964MB (lfb 214x4MB) SWAP 0/1982MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [18%@1479,16%@1479,70%@1479,18%@1479] EMC_FREQ 6%@1600 GR3D_FREQ 7%@768 APE 25 PLL@28C CPU@31.5C PMIC@50C GPU@29C AO@36.5C thermal@30.25C POM_5V_IN 2941/2908 POM_5V_GPU 158/162 POM_5V_CPU 990/985
Thanks for the tegrastats
dump. This is odd behavior indeed. I have thought about (and implemented in other projects, not this one) conditional imports based on whether the user has cusignal
available or not. This software won't even run if cusignal
is not found.
It's possible that the scheduler on the OS (due to some different defaults in a newer Jetpack image?) is kneecapping the program by not giving the correlator tasks full access to their own processors. I haven't updated my Jetpack image in some time, because I haven't needed to, but that's certainly something I can try if necessary to reproduce your issue. I'd just have to back up my current SD card image.
I encourage you having a poke around. Maybe see _startup_task()
, where I start up everything that needs to happen concurrently:
multiprocessing.Process
for asking for samples from each SDR using pyrtlsdr
asyncio
functions_get_kbd()
), and is likely your culprit for 100% CPU usage on one or another core. The OS scheduler probably hands it off to one or another core, so it moves around. But this should not be an issue, and it's for a good cause; this could be disabled entirely, since we automatically go through the CALIBRATE
state on startup, but you wouldn't be able to recalibrate on demand again later.The GPU processing task is called on each loop through the main body of the program, in run_state_machine()
.
I have not made any efforts to assign or pin any of these to a specific processor core, leaving that management up to the OS. It's possible that they could benefit from more babysitting, but that's starting to get fairly involved, and I'd prefer to keep it as a last resort.
Re: this
Just to make sure there wasn't something silly going on with having no input signal, I put a noise source with a y-splitter on both RTL-SDRv3 inputs. No noticeable change.
You can be assured this isn't a problem. I run without a splitter, just open SDR inputs all the time when I'm testing portions of the code unrelated to cross-correlation. The only difference it makes is that the cross-correlation of both channels is garbage, since they are independent and uncorrelated noisy samples from the open SDR inputs. It doesn't affect the speed of execution or processing load in any way.
Querying my Jetpack version info:
(effex-dev) evanmayer@evanjetson:~/github/effex$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 3.1, GCID: 18186506, BOARD: t210ref, EABI: aarch64, DATE: Tue Dec 10 06:58:34 UTC 2019
https://forums.developer.nvidia.com/t/jetpack-version-check/221308
Looks like I'm on L4T 32, rev. 3.1. So if I'm reading this right, that would be Jetpack 4.3.
Output from running tegrastats
alongside on my machine:
You could also try things like in the cuSignal examples as a 0th order check to see if performance is in line with their expectations.
I have pushed a commit (f151932540c15dd61883585594db93fa3ff9bd9f) with some additional timing measurements around important parts of the state machine. When you have a chance, can you git pull
and run:
python effex/effex/effex.py --time 5 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 262144 --resolution 4096 --gain 29.7 --mode spectrum --omit_plot True --loglevel=DEBUG
Thanks! It looks like my GPU task is taking at least an order of magnitude longer. Seems like a lot of mine are around a whole 1/3 of a second!
python effex/effex.py --time 5 --bandwidth 2.4e6 --num_samp 262144 --resolution 4096 --gain 29.7 --mode spectrum --omit_plot True --loglevel=DEBUG
I apologize for the formatting. I don't know if it's because I'm copying from an ssh running in Mac terminal, or if I'm missing something in the details tag.
I tried a few of the cusignal Jupyter notebook examples, and the GPU seems to be a little slower than the published version. However, I don't know what hardware they were using.
Published:
%%timeit gpu_fft = cp.fft.fft(gpu_signal)
315 µs ± 8.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
mine: 688 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
You can block quote by adding ``` (3 backticks) above and below the block of text.
Executing the cell right before "Allocating FFT Plan before Invocation"
%%timeit
gpu_fft = cp.fft.fft(gpu_signal)
681 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We have the same performance. Head scratcher.
What if N = 2**18
?
%%timeit
gpu_fft = cp.fft.fft(gpu_signal)
6.33 ms ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With N = 2**18, I'm right there with you.
6.32 ms ± 462 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
There must be an issue with one of the cusignal function calls. Try running this for comparison:
pytest -vv --durations=0 -k test_channelizepoly_gpu
<details>
<summary>channelize_poly</summary>
platform linux -- Python 3.10.8, pytest-7.2.1, pluggy-1.0.0 -- /home/ewthornton/miniforge3/envs/cusignal-dev/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=25 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=True warmup_iterations=10)
rootdir: /home/ewthornton/cusignal/python, configfile: setup.cfg
plugins: benchmark-4.0.0, anyio-3.6.2
collected 1265 items / 1253 deselected / 12 selected
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float32] PASSED [ 8%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float64] PASSED [ 16%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex64] PASSED [ 25%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex128] PASSED [ 33%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float32] PASSED [ 41%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float64] PASSED [ 50%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex64] PASSED [ 58%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex128] PASSED [ 66%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float32] PASSED [ 75%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float64] PASSED [ 83%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex64] PASSED [ 91%]
cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex128] PASSED [100%]
================================================== slowest durations ===================================================
2.45s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float32]
0.13s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex128]
0.13s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex128]
0.12s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex64]
0.12s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex64]
0.12s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex128]
0.12s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex64]
0.11s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float64]
0.10s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float64]
0.10s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float64]
0.10s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float32]
0.09s call cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float32]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float32]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float64]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex128]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex64]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex128]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex64]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex128]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float64]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float64]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float32]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float32]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex128]
0.00s setup cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex64]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float64]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float32]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-float32]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float32]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-float64]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex64]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[256-2048-4096-complex64]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-float64]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex128]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[64-2048-4096-complex128]
0.00s teardown cusignal/test/test_filtering.py::TestFilter::TestChannelizePoly::test_channelizepoly_gpu[128-2048-4096-complex64]
========================================= 12 passed, 1253 deselected in 7.91s ==========================================
I'm on this version of cusignal.
git status
On branch branch-23.02
Your branch is up to date with 'origin/branch-23.02'.
I'm on this one:
(base) evanmayer@evanjetson:~/github/cusignal$ git status
On branch branch-0.16
Your branch is up to date with 'origin/branch-0.16'.
It hasn't been updated in a while. I would be upset if this were a regression related to that. I'll report back when I've been able to update fully - it's going to require CUDA 10.0 -> 11.8, which requires Ubuntu 18.04 -> 20.04, so it will be a bit before I can re-create the latest cuSignal conda env.
You shouldn't have to go to 20.04. I'm still on 18.04. Maybe that is part of the issue? Cusignal built fine.
Ok, thanks. Installing the latest cusignal-dev
conda env failed for me during cupy
install, throwing an error about me being on CUDA 10.0 < 10.2. I'll see if I can find a way to satisfy that without the above steps, then. The easiest answer might be to flash a more recent Jetpack version and try cuSignal install after that.
Edit: unless this happens to work... https://docs.nvidia.com/jetson/jetpack/install-jetpack/index.html#package-management-tool
I ran apt update/upgrade after installing the 4.6.1 SD card image, and it upgraded me to 4.6.3. It may work.
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0
Ok! Fresh Jetpack flashed onto the SD card.
(effex-dev) evanmayer@evanjetson:~/github/effex$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0
Running
python effex/effex.py --time 5 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 262144 --resolution 4096 --gain 29.7 --mode spectrum --loglevel=DEBUG
replicates your very first crash, so fix usbfs buffer size like you did:
sudo sh -c 'echo 0 > /sys/module/usbcore/parameters/usbfs_memory_mb'
python effex/effex.py --time 5 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 262144 --resolution 4096 --gain 29.7 --mode spectrum --omit_plot=True --loglevel=DEBUG
yields:
I have replicated your poor performance. The SDR buffers are definitely filling up, although this run isn't long enough to throw the warning.
I profiled _run_task()
with nvprof
like this. Sorry, it's a hell of a command. But you don't have to run it! Read on:
sudo env "PATH=$PATH" /usr/local/cuda/bin/nvprof --profile-from-start off -o profile.out python3 effex/effex.py --time 5 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 262144 --resolution 4096 --gain 29.7 --mode spectrum --omit_plot True --loglevel=DEBUG
Yielding:
It's not in order of hottest to coldest, but chronological. I'm not familiar with this profiler...it's a little hard to read, but _cupy_channelizer_8x8_complex128_complex128
is obviously the longest execution time, at ~10 ms per. There are two of these per call to _run_task()
, so that brings us to ~0.02 s. I was a little surprised the rest of the functions could add up to ~10x, so I looked further, and put in some more rough manual timing:
# Threading to take ffts using polyphase filterbank
t0 = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as iq_processor:
t1 = time.time()
future_0 = iq_processor.submit(self._spectrometer_poly, *(cp.array(self.gpu_iq_0), self.ntaps, self.nbins, self.window))
future_1 = iq_processor.submit(self._spectrometer_poly, *(cp.array(self.gpu_iq_1), self.ntaps, self.nbins, self.window))
f0 = future_0.result()
f1 = future_1.result()
t2 = time.time()
t3 = time.time()
#f0 = self._spectrometer_poly(cp.array(self.gpu_iq_0), self.ntaps, self.nbins, self.window)
#f1 = self._spectrometer_poly(cp.array(self.gpu_iq_1), self.ntaps, self.nbins, self.window)
# http://www.gmrt.ncra.tifr.res.in/doc/WEBLF/LFRA/node70.html
# implemented according to Thompson, Moran, Swenson's Interferometry and
# Synthesis in Radio Astronoy, 3rd ed., p.364: Fractional Sample Delay
# Correction
freqs = cp.fft.fftfreq(f0.shape[-1], d=1/self.bandwidth) + self.frequency
# Calculate cross-power spectrum and apply FSTC by a phase gradient
rot = cp.exp(-2j * cp.pi * freqs * (-self.calibrated_delay))
xpower_spec = f0 * cp.conj(f1 * rot)
xpower_spec = cp.fft.fftshift(xpower_spec.mean(axis=0))
if self.mode in ['CONTINUUM', 'TEST']: # don't save spectral information
vis = xpower_spec.mean(axis=0) / self.bandwidth # a visibility amplitude estimate
else:
vis = xpower_spec
t4 = time.time()
self.logger.debug(f'Time to start threadpool: {t1 - t0}')
self.logger.debug(f'Time to do actual GPU processing: {t2 - t1}')
self.logger.debug(f'Time to do close threadpool: {t3 - t2}')
self.logger.debug(f'Time to complete other _run_task() tasks: {t4 - t3}')
return vis
It's taking 0.3 s to exit a ThreadPool context! I'm not sure why this is so much worse on this new Jetpack image. After some thought, I am OK with getting rid of this. There is only 1 GPU, so multithreaded concurrency gets us no performance gains here because the channelizer calls occur sequentially anyway. I have pushed a commit that ditches this ThreadPool, and performance should be much improved:
Please give it a try and let me know how it works.
That appears to have fixed it! I tried running up to 60s, and had no buffer overflows.
Thanks for diving into it. It would have taken me 10x longer to figure it out.
First off, excellent work on this software. I have a remote sensing application in mind, but I'm first trying to get a dry run going.
I'm using a 4gb Jetson Nano, with Jetpack 461.
Probably unrelated, but I first get a crash. Seems the kernel defaults need tuning.
Allocating 15 zero-copy buffers Allocating 15 zero-copy buffers Failed to submit transfer 1 Please increase your allowed usbfs buffer size with the following command: echo 0 > /sys/module/usbcore/parameters/usbfs_memory_mb Failed to submit transfer 0 Please increase your allowed usbfs buffer size with the following command: echo 0 > /sys/module/usbcore/parameters/usbfs_memory_mb
running this takes care of this error.sudo sh -c 'echo 0 > /sys/module/usbcore/parameters/usbfs_memory_mb'
Next, I run this from your README.
python effex/effex/effex.py --time 60 --bandwidth 2.4e6 --frequency 1.4204e9 --num_samp 262144 --resolution 1024 --gain 29.7 --mode spectrum --loglevel=DEBUG
Next run starts out ok, but the correlator can't seem to keep up. The queue sizes keep increasing on the SDR buffers. Not sure why this setup would be different than yours, other than I don't have antennas hooked up yet.
2023-01-18 21:33:51,348 - main - DEBUG - State transition: STARTUP to CALIBRATE 2023-01-18 21:33:51,349 - main - DEBUG - SDR buffer 0 size: 0 2023-01-18 21:33:51,349 - main - DEBUG - SDR buffer 1 size: 0 2023-01-18 21:33:51,349 - main - DEBUG - Correlation buffer size: 0 Allocating 15 zero-copy buffers Allocating 15 zero-copy buffers 2023-01-18 21:33:52,537 - main - DEBUG - Starting calibration 2023-01-18 21:33:54,062 - main - INFO - Estimated delay (us): 16253.33074552434 2023-01-18 21:33:54,062 - main - DEBUG - State transition: CALIBRATE to RUN 2023-01-18 21:33:54,063 - main - DEBUG - SDR buffer 0 size: 14 2023-01-18 21:33:54,063 - main - DEBUG - SDR buffer 1 size: 14 2023-01-18 21:33:54,063 - main - DEBUG - Correlation buffer size: 0 2023-01-18 21:33:54,530 - main - DEBUG - SDR buffer 0 size: 18 2023-01-18 21:33:54,531 - main - DEBUG - SDR buffer 1 size: 18 2023-01-18 21:33:54,531 - main - DEBUG - Correlation buffer size: 1 ... 2023-01-18 21:34:00,690 - main - DEBUG - SDR buffer 0 size: 59 2023-01-18 21:34:00,691 - main - DEBUG - SDR buffer 1 size: 59 2023-01-18 21:34:00,698 - main - DEBUG - Correlation buffer size: 2 2023-01-18 21:34:00,721 - main - WARNING - SDR buffer 0 filled up. Data may have been lost! 2023-01-18 21:34:00,722 - main - WARNING - SDR buffer 1 filled up. Data may have been lost! 2023-01-18 21:34:01,151 - main - DEBUG - SDR buffer 0 size: 59 2023-01-18 21:34:01,151 - main - DEBUG - SDR buffer 1 size: 59 2023-01-18 21:34:01,152 - main - DEBUG - Correlation buffer size: 1 2023-01-18 21:34:01,153 - main - WARNING - SDR buffer 0 filled up. Data may have been lost! 2023-01-18 21:34:01,153 - main - WARNING - SDR buffer 1 filled up. Data may have been lost!