bondagit / aes67-linux-daemon

AES67 Linux Daemon with configuration WebUI
GNU General Public License v3.0
353 stars 81 forks source link

occasional drops detected in output RTP stream #17

Closed bondagit closed 3 years ago

bondagit commented 4 years ago

As reported by @jtc-dolby we have an issue with the ALSA RAVENNA driver causing occasional drops or stream realignment in output RTP stream. It's possible to detect the problem by playing on the RAVENNA device a low frequency e.g. 100Hz sine wave and checking the resulting recording on another device. The issue can be reproduced with both memory mapped and read/write interleaved modes.

See enclosed pictures:

drop_mmap drop_no_mmap

According to my tests the problem doesn't show up on the recording side and it seems to affect the playback only.

After a number of tries I could associate the issue to the following debug prints logged by the driver (see lines in bold):

[432559.316518] entering mr_alsa_audio_pcm_trigger (substream name=subdevice #0 #0) ... [432559.316521] mr_alsa_audio_pcm_trigger(Start), rate=48000 format=32 channels=2 period_size=48 [432559.316523] starting playback I/O [432559.323410] LastTICCounter = 2636220 ui64TICCounter = 2636227 (Timer period = 69 [100us]) [432559.323427] Timer period out of range: 0 [ms]. Target period = 1 [432559.323428] next_wakeup: 432561021200000 now: 432561021145163 [432560.597372] LastTICCounter = 2637506 ui64TICCounter = 2637501 (Timer period = 9 [100us]) [432660.226216] LastTICCounter = 2737128 ui64TICCounter = 2737130 (Timer period = 18 [100us]) [432660.227235] LastTICCounter = 2737131 ui64TICCounter = 2737131 (Timer period = 0 [100us]) [432719.026270] Timer period out of range: 0 [ms]. Target period = 1 [432719.026276] next_wakeup: 432720725500000 now: 432720725422743

bondagit commented 4 years ago

Hi @jtc-dolby, the problem doesn't seem to show up anymore if I recompile the Linux kernel with the timer frequency set to 100Hz: CONFIG_HZ=100 In the kernel menu config you can change the timer frequency in: Kernel features -> Timer frequency Can you try this and report if it works for you ? Thanks

bondagit commented 4 years ago

Also this a frequency of 250Hz the problem doesn't show up.

jtc-dolby commented 3 years ago

You said that the problem doesn't show up with 250Hz but this is the default frequency in kernel config file. If this is true then why would expect a difference in behaviour with 100Hz? I will try 100Hz anyway and see if it changes anything.

jtc-dolby commented 3 years ago

I tested with 100Hz. Not good news. I got drops using aplay both with and without the -M option (Memory mapped mode). (Before I got drops only with the -M option). The drops were much more frequent when using the -M option: 43 vs 5 drops over a 45min period.

drtechno1971 commented 3 years ago

Try: CONFIG_HZ=1000 CONFIG_NO_HZ_FULL=y Problem with this driver it uses a webui instead of plain C console echo, but this might fair better. Also, It might be better to write a driver extension to Alsa instead of using a 3rd party virtual driver to lower latency.

drtechno1971 commented 3 years ago

Also you might want to try passing idle=mwait at boot time so the kernal doesn't reschedule the timing due to others using adaptive timing ticks.

bondagit commented 3 years ago

I have finally found some time to proceed on the debugging of this issue and I have just pushed on the branch test_issue_17 the files required to run a test on the loopback network interface and to verify the integrity of the wav file recorded. In the test I configure a single 48Khz L24 stereo source and the corresponding sink to playback and record on the loopback network interface a sample wav file created artificially to enable simple detection of corruptions on the recording. To run the test and create the recording:

./run_test.sh

To verify that the recording is correct:

cd test
c++ check.cc -o check
./check

The expected output is : "ok", otherwise the program prints out the file position where the corruption was detected. I have run the test successfully many times on both ARM and x86_64 platforms and I never had a single invalid recording. In my tests I have tried to use both mermory mapped and read/write interleaved modes.

jtc-dolby commented 3 years ago

Thanks for the new information. It’s been a while since I looked at it so I can see if it is still an issue on my end. The way I was testing was to use ‘aplay’ to transmit a known sequence and then recording the output. I was seeing errors with ‘-M’ switch i.e. memory mapped mode but not using regular mode. I can repeat this test for you if that will help.

From: Andrea Bondavalli notifications@github.com Reply-To: bondagit/aes67-linux-daemon reply@reply.github.com Date: Tuesday, January 19, 2021 at 11:44 AM To: bondagit/aes67-linux-daemon aes67-linux-daemon@noreply.github.com Cc: James Cowdery JTC@dolby.com, Mention mention@noreply.github.com Subject: Re: [bondagit/aes67-linux-daemon] occasional drops or stream realignment detected in output RTP stream (#17)

I have finally found some time to proceed on the debugging of this issue and I have just pushed on the branch test_issue_17 the files required to run a test on the loopback network interface and to verify the integrity of the wav file recorded. In the test I configure a single 48Khz L24 stereo source and the corresponding sink to playback and record on the loopback network interface a sample wav file created artificially to enable simple detection of corruptions on the recording. To run the test and create the recording:

./run_test.sh

To verify that the recording is correct:

cd test

c++ check.cc -o check

./check

The expected output is : "ok", otherwise the program prints out the file position where the corruption was detected. I have run the test successfully many times on both ARM and x86_64 platforms and I never had a single invalid recording. In my tests I have tried to use both mermory mapped and read/write interleaved modes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bondagit_aes67-2Dlinux-2Ddaemon_issues_17-23issuecomment-2D763084458&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=AHk4BVn6sW_wKrFU1lW6bg&m=tGCTaPPW8se_0Y_E_IWspaq-FAhkrgvXTC4i8uynEko&s=hh1xX3NnpYEYcMK-xDaZcdMJf1s0rL1rgFBYqpU6wq8&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFPBWXCWVY46IZ7JZ4CJ3YLS2XOJZANCNFSM4PSSEYQQ&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=AHk4BVn6sW_wKrFU1lW6bg&m=tGCTaPPW8se_0Y_E_IWspaq-FAhkrgvXTC4i8uynEko&s=tt8yWVREud5SsIW7lPHvw_8T9EjLVCyGsQstPTm0_1c&e=.

bondagit commented 3 years ago

When I have originally opened the issue I was running the tests using the daemon to playback a 48Khz L24 stereo file and the Audinate Dante USB to perform the recording. Apparently this device has strict latency constrains and when it receives a late packet (that is a packet exceeding the 2ms delay in this configuration) it replaces the late samples with zeros and this leads to such recordings. I enclosed a picture showing such late packets as reported by the Dante device latency tab: Dante latency

bondagit commented 3 years ago

hi @jtc-dolby , great to hear you again ;-) Can you just try to run the simple test on the test_issue_17 branch ? I'd like to hear the results. Thanks. Anyway this is not conclusive analysis and I am going to perform more device to device (not loopback) tests.

bondagit commented 3 years ago

At present I suspect that the issue is actually related to these late packets. At the moment I am not sure why I have such late packets when I run tests on the X86_64 platform but I have none when using some ARM boards I tested here. In my opinion the issue is not related to the daemon or the Merging driver but it might come from the Linux kernel or the network device or driver in use. I enclosed a picture with the latency measurement of a stream sent by the daemon running on NanoPi2 with Linux nanopineo2 5.8.8-sunxi64 kenrl. As you can see I have no late packets and the peak latency is 1.4ms Dante latency NanoPi2 On this board the latency keeps stable also on a longer period of time and I don't have a single late packet, see picture below: Dante latency NanoPi2 -2

jtc-dolby commented 3 years ago

Thanks for the kind words. Yes I’m still here☺ I’ll try running the test as you suggest. For your reference, I use a Lawo receiver with 10ms latency.

From: Andrea Bondavalli notifications@github.com Reply-To: bondagit/aes67-linux-daemon reply@reply.github.com Date: Tuesday, January 19, 2021 at 12:03 PM To: bondagit/aes67-linux-daemon aes67-linux-daemon@noreply.github.com Cc: James Cowdery JTC@dolby.com, Mention mention@noreply.github.com Subject: Re: [bondagit/aes67-linux-daemon] occasional drops or stream realignment detected in output RTP stream (#17)

hi @jtc-dolbyhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jtc-2Ddolby&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=AHk4BVn6sW_wKrFU1lW6bg&m=Rvv-7mj2Kbqz2zNnEnmEKwUHOanv6WaM5ZDqccIP_vM&s=c1VozeJYaRRotBGXclFJOcOfmXQLfV8XRere4-5E-dQ&e= , great to hear you again ;-) Can you just try to run the simple test on the test_issue_17 branch ? I'd like to hear the results. Thanks. Anyway this is not conclusive analysis and I am going to perform more device to device (not loopback) tests.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bondagit_aes67-2Dlinux-2Ddaemon_issues_17-23issuecomment-2D763095075&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=AHk4BVn6sW_wKrFU1lW6bg&m=Rvv-7mj2Kbqz2zNnEnmEKwUHOanv6WaM5ZDqccIP_vM&s=tEsntt_Pq0ty_JtNBAlCX9m-Ec2KEmoe0pLs2kJWnb4&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFPBWXHFZZXDCAUWJUCBU2DS2XQSRANCNFSM4PSSEYQQ&d=DwMCaQ&c=lI8Zb6TzM3d1tX4iEu7bpg&r=AHk4BVn6sW_wKrFU1lW6bg&m=Rvv-7mj2Kbqz2zNnEnmEKwUHOanv6WaM5ZDqccIP_vM&s=mqnuOw_aOoYevp8zAVYrYIs1FWf1tImCOP_Fdu4vNFY&e=.

bondagit commented 3 years ago

After additional tests on the X86 platform I can reproduce a corruption of the output file by changing the driver base timer period to 48 ("tic_frame_size_at_1fs": 48) in the daemon configuration file used by the test (test/daemon.conf). This was set to 192 in my commit. So I repeated the tests changing the "tic_frame_size_at_1fs" parameter to 48 (1 ms), 96 (2 ms) and 192 (4 ms). The result is that while on the ARM platforms (I tested so far) the output file is always OK for all the values, on my X86_64 I can systematically reproduce a corruption with 48 only and it always works with 96 and 192. On this platform my attempts to use a low latency or preemptive kernel didn’t solve the issue. The root cause of the issue is unknown to me but the test I implemented can be used to verify if a specific platform is able to serve or not at the specified base period. In case of a negative result (corruption of the output file) my suggestion is to switch to another.

jtc-dolby commented 3 years ago

I have executed the 'run_test.sh' on the issue 17 branch several times without making any changes and the checker returns OK. Should I try again with tic set to 48 instead of 192?

jtc-dolby commented 3 years ago

I ran the tests again with TIC=48 three times and it was OK everytime.

gugaguga24678 commented 3 years ago

When I have originally opened the issue I was running the tests using the daemon to playback a 48Khz L24 stereo file and the Audinate Dante USB to perform the recording. Apparently this device has strict latency constrains and when it receives a late packet (that is a packet exceeding the 2ms delay in this configuration) it replaces the late samples with zeros and this leads to such recordings. I enclosed a picture showing such late packets as reported by the Dante device latency tab: Dante latency

Hi @bondagit ,is it possible to change the delay setting from 2ms to 5ms? I have the same problem. When setting tic_frame_size_at_1fs to 48, the Dante Controller shows more than 50% of the packets are late. The drops were much more frequent when setting tic_frame_size_at_1fs to 96 or 192. I ran the run_test.sh and check several times, the program prints out "error at position:" every time.

bondagit commented 3 years ago

Hi @bondagit ,is it possible to change the delay setting from 2ms to 5ms?

I don't know how to do this, check with Audinate.

I have the same problem. When setting tic_frame_size_at_1fs to 48, the Dante Controller shows more than 50% of the packets are late. The drops were much more frequent when setting tic_frame_size_at_1fs to 96 or 192. I ran the run_test.sh and check several times, the program prints out "error at position:" every time.

What platform are you using for the tests ? What Linux Kernel version ? The parameter "tic_frame_size_at_1fs" sets the driver base period. At each tick of this period the driver schedule the outbound RTP packets for sending and processes the inbound packets. So if your Dante is using a delay setting of 2ms you have to set tic_frame_size_at_1fs to 48 (1ms), any larger setting cannot work. Despite this as you can see you still have late packets and also the loopback test fails so your platform doesn't seem to support this setting. My suggestion is to try with another board.

bondagit commented 3 years ago

Hi @jtc-dolby

I ran the tests again with TIC=48 three times and it was OK everytime.

So your platform looks ok, or at least it's ok with a 4 mins recording. Probably we should test it for a longer period of time. But, if I remember it correctly, you reported drops every minute in case of playback in memory mapped mode and this is weird. Do you have a Dante AVIO USB device ? can you monitor for late packets and measure the avg and the peak latency of the output stream ?

jtc-dolby commented 3 years ago

I did see drops fairly frequently (every minute or so). I don't have a Dante AVIO device but the Lawo devices I do have gather statistics regarding late packets. I can use the same build on the same machine as I just ran the tests and see what happens. Is it possible there are differences between stream type or codec? My focus was the AM824 because this is carrying data and that is covered by a CRC so corruption is very apparent. I can put data inside L16/L24 for testing purposes.

Aika0 commented 3 years ago

Testing issue 17 with "tic_frame_size_at_1fs": 48 on various Raspberry Pis

Device A (10.1.14.120) Raspberry Pi 4 Model B Rev 1.1 Hardware: BCM2711 Revision: a03111 Kernel: 5.4.83-v7l+ Result 1: ok Result 2: ok Result 3: ok

Device B (10.1.14.77) Raspberry Pi 3 Model B Plus Rev 1.3 Hardware: BCM2845 Revision: a020d3 Kernel: 5.4.83-v7+ Result 1: ok Result 2: ok Result 3: ok

Device C (10.1.14.55) Raspberry Pi 3 Model B Rev 1.2 Hardware: BCM2835 Revision: a02082 Kernel: 5.4.83-v7+ Result 1: ok Result 2: ok Result 3: ok

Device D (10.1.14.92) Raspberry Pi 2 Model B Rev 1.1 Hardware: BCM2835 Revision: a21041 Kernel: 5.4.83-v7+ Result 1: ok Result 2: ok Result 3: ok

bondagit commented 3 years ago

Testing issue 17 with "tic_frame_size_at_1fs": 48 on various Raspberry Pis ....

Great job, thank you. The results were as expected as I was able to get it to work correctly even on a board with an ARM SOC older than the ones you tested. My idea was to create a new page in the documentation with a list of compatible HWs. Anyway at the moment the test validates only the configuration with a single L24 / stereo / 48Khz source and sink which is however one of the most used.

bondagit commented 3 years ago

I did see drops fairly frequently (every minute or so). I don't have a Dante AVIO device but the Lawo devices I do have gather statistics regarding late packets. I can use the same build on the same machine as I just ran the tests and see what happens.

Thanks, I think this should help to narrow down the issue: If you have late packets it can be that your receiver device is dropping them (replacing the late samples with silence).

Is it possible there are differences between stream type or codec? My focus was the AM824 because this is carrying data and that is covered by a CRC so corruption is very apparent. I can put data inside L16/L24 for testing purposes.

Yes, a different codec can make a difference. For this reason I was also thinking to enhance this test suite to allow for different configurations. In the meantime you could try modifying my test to play an AM824 (L32) file and record an L32 wav file. Afterwards you should be able to use the CRC to detect registration errors instead of relying on the samples sequence as I did. This way we can test using the loopback interface and we exclude any network related issue.

bondagit commented 3 years ago

Hi @jtc-dolby , first of all please include a couple of new patches I have just released for the driver. Apart from this I enhanced the test suite to support multiple configurations and I re-executed some tests on the platforms I have here with the results reported below.

On X86_64 with tic_frame_size_at_1fs set to 48 in test/daemon.conf, I have the following results: ./run_test.sh S16_LE 48000 2 5 -> fails ./run_test.sh S24_3LE 48000 2 5 -> fails ./run_test.sh S32_LE 48000 2 5 -> fails On X86_64 with the tic_frame_size_at_1fs set to 96 and 192 in test/daemon.conf, I have the following results: ./run_test.sh S16_LE 48000 2 5 -> OK ./run_test.sh S24_3LE 48000 2 5 -> OK ./run_test.sh S32_LE 48000 2 5 -> OK On ARM with tic_frame_size_at_1fs set to 48, 96, 192 in test/daemon.conf, I have the following results: ./run_test.sh S16_LE 48000 2 5 -> OK ./run_test.sh S24_3LE 48000 2 5 -> OK ./run_test.sh S32_LE 48000 2 5 -> OK

So, according to my tests the sample format doesn't affect the test result. One of the possible reasons is that the samples size of driver internal playback and capture buffers is always set to 32 bit.

Can you repeat these 3 tests using your platform ? Thanks.

bondagit commented 3 years ago

I have just found another issue in the driver. This time it's related to the playback in read-write interleaved mode and it affects the initial samples of the output stream. The issue is solved by 77cf7bd. This patch has anyway no effect on the ./run_test.sh results reported above. The new problem can be easly reproduced by removing the option (-M) to aplay in run_test.sh script.

bondagit commented 3 years ago

Hi @jtc-dolby, did you have the chance to run the new tests? Thanks

jtc-dolby commented 3 years ago

I ran the test. Results below. I didn't reinstall the driver between tests but just edited the conf file and reran the script. Is that correct?

Here are my results from the tests: On X86_64 - Ubuntu 19.10 with tic_frame_size_at_1fs set to 48 in test/daemon.conf, I have the following results: ./run_test.sh S16_LE 48000 2 5 -> fails ./run_test.sh S24_3LE 48000 2 5 -> fails ./run_test.sh S32_LE 48000 2 5 -> fails with the tic_frame_size_at_1fs set to 96 in test/daemon.conf, I have the following results: ./run_test.sh S16_LE 48000 2 5 -> fails ./run_test.sh S24_3LE 48000 2 5 -> fails ./run_test.sh S32_LE 48000 2 5 -> OK with the tic_frame_size_at_1fs set to 192 in test/daemon.conf, I have the following results: ./run_test.sh S16_LE 48000 2 5 -> OK ./run_test.sh S24_3LE 48000 2 5 -> fails ./run_test.sh S32_LE 48000 2 5 -> OK

bondagit commented 3 years ago

Yes, that's correct because the last test script uninstall and reinstall the driver at every run. It's hard to draw a conclusion here but what I can certainly say is that you cannot achieve 1ms latency on the platform you are using, or better saying you cannot achieve it with a good level of reliabilty. And this is also consistent with the results you had using your final setup. Please consider that the test result is very much affected by contention among the drivers & processes running on your platform and it provides only an indication of the latency you can achieve: the whole solution is not thought to offer a reliable communication channel but it's meant to keep the latency under control. Anyway it can still be that the problem depends on some race condition or bug in the driver code. At present I can achieve the best results using ARM boards running a minimal Linux distro dedicated to AES67 only.

jtc-dolby commented 3 years ago

OK. I think that is a reasonable theory. My machine is pretty old and 1ms is quite tight. I'd like to prove it by increasing the latency and making the problems go away. 192 (4ms) certainly looks better. Can I go higher? How about 480 (10ms)?

bondagit commented 3 years ago

yes, I have succesfully tried to raise it up 480 (10ms) included. If think even more should be possible. I have just noticed that the last patch I released may create troubles. Please repeat the test leaving out the folloiwing patch for the driver:

ravenna-alsa-fix-playback-rw-mode.patch

This was not included at the time I ran the tests my side. I will revise it as soon as I have time.

bondagit commented 3 years ago

Hi @jtc-dolby, I have reworked the patch for the read-write interleaved mode and also simplified and unified the driver PCM interface. See commit 8a56006 The previous patch was not always working properly and causing troubles also in memory mapped mode. After this new patch I re-tested the configurations above multiple times on both X86_64 and ARM and I consistently have the same results already reported (by me) with memory mapped and read-write interleaved modes. Could you re-try the tests your side using the same configurations you used previously and the latest build ? Thanks.

keygee commented 3 years ago

Hi Andrea, I made some tests and I would like to share with you my results:

tests made:

tic_frame_size_at_1fs set to 48:

tic_frame_size_at_1fs set to 96:

tic_frame_size_at_1fs set to 192:

I made these tests on:

Machine 1:

Machine 2

Machine 3

All results have been OK in all machines, except if I make "stressing work" during the test (compiling heavy code or transcoding videos...). Repeating the tests in a idle state resulted always OK.

If could be of any interest, I can share my .config ( or part of it ) of my kernels Another info that could (or not) be of any interest, is that I compiled all without clang, because I haven't in my machines.

If you need more info or tests, please let me know.

Cheers Guido

bondagit commented 3 years ago

Hi Guido , thanks a lot for taking the time to run these tests. Your timely contribution is very much appreciated ;-) and your tests also show that there is no open issue on the X86_64 platform. I have also run more tests my side and the driver situation looks pretty stable now. Of course as you experienced the tests results may be affected by other loads running on the CPU at the same time and this has been discussed already and the problem could be mitigated by setting high priority for aplay & arecord or just by taking care to stop additional loads during the tests ... I will move the test suite to the master branch as replacement of the demo, add some documentation and prepare the repo for release 1.0