DJ2LS / FreeDATA

A free, open-source, multi-platform application for sending files and messages, using the codec2 HF modems
https://wiki.freedata.app/
GNU General Public License v3.0
144 stars 18 forks source link

File transfer failure #129

Closed pe1rrr closed 2 years ago

pe1rrr commented 2 years ago

I have managed to successfully set up FreeData on two raspberry Pi4s after upgrading them both to Bullseye and removing all Hamlib usage from the equation, that's for another issue.

In this issue the TNC receiving a file returns a trace back during the initial stages of a file transfer, here is the trace:

RX_N_FRAME_OF_BURST-0 RX_N_FRAMES_PER_BURST-1 2022-01-22 13:27:52 [debug ] [TNC] static.RX_BURST_BUFFER buffer=[b"BOF2rx\x9c\xec\xbaW\xcf\xac\xc0\xda%\xf6_\xbe\xdb\x1ey\x93\x1b\xc6W\xd0\xe4\x9c\x93\xde\x1bR\x93s\xc6\x7f\xde\xf5\x9es\xbe\xcb#Y\xb2\xe4\x99\x8b]R7\r\rE\xd5\x93\xd6Z\x05\xff\xc7\xbf\xe5\xdb\xbf\xfd\xf7\x7f\xfb\xfe\xdb\x7f\xfb\xb7\xef\x00~\xa4\xe9\xfa\xbf5S\xf9\xbb\xfb{\xbc\xee\x93\xb2\xf8\xf9\xd3L\xc5\xef\xa1\x1c\x1c\xc9\x93-\xf9\xef\xff\xe5\xf0\xff\x9e&kA\xff\xed\xe7\x0f\xd5\xfc\xfc\xc1h\xdar\xdaX\xb6K\x9a\xa1-\x8e\x8eh\xa6\xa4i\x16\xfca\xb3\x8e\x1dVS\x0cv]\x0f\xa2\x15\xb0\xa5\xe9\xdf\xb3\xb87MK\xbf;\\\xea2--\xe5\x13\xe8\xe3w\x9f\x01\x1f\xc9\x99\xee\x1c\xfc\xf5\xbb\xff{\x92u\xeecF\xd3\xd9\xef\xb5\xda\xefA\xf3\x1f\xfd\xd0\x19\xf1\xaf\xfe\xfe\xb6\xbf\xedo\xfb\xdb\xfe\xb6\xbf\xedo\xfb\xdb\xfe\xb6\xbf\xed\x7f\xf1\xc6\xb7\xb4\xf6O\xea\xca\x00\xde\xcb|\xa2\xf6\x97\xff\xfe\xe7\xfe\xdaN\xdc\xbf\xa8\xadv\xeb\xdbN\xd2\xffe\x9f\xa8N\x9a9i\xfaC\x83/\xba\xfe\x07/\xa6\x99g\xfe\xdd|\xfe\xe7\xcc\xe9o\xfb\xdb\xfe\xb6\xbf\xedo\xfb\xdb\xfe\xb6\xbf\xedo\xfb\xdb\xfe_7\xad\xa1/\xed\x19O\xdd\x1do\xcd\xa1O\x8d\x05[v|\xb4\x93\x06[\xee1\x1a\x1a6\x1a\xe9\x92X\xfa\x04[\xf0\xd1\x9e\xdf\xebR\x0b\x10\xe1w\xed\xf8w\xc1\xd9Z\x83D\xb4\xa1\x8c\x1d\x0f\x15\xc1\x1f\xb5\xe7\xdb\x14\x91;\xb5\xd7\x8f\xd4\xa1\xb0(\xa4\x0f\xcd\xc1N\xf5\xa4I\x13\xad\xce(\xd0\x978\xb4\xca\xa8\xf7\x87$\xc0(\xf9\xf5z\x1fwV&\x81E\xc9p\x06\xbb.\xa3g\x9c\xde%B'\x15\x83\xdcy\xe8d\xf8\x82N$\x88\x86\xc7\x9f\xec\xe7\x8fYB\x8a)\x96D!\xc0g\x1a\xf8P\xe40X\x1a\\{\xf6L\x98\xe9\xc8U,Pu\xecN\xbf\xfb[\x1c\xda\x95zK/S\x94\xdb\xb8\x99\x1c\x9b\x8bJp\xdd\x9a\x0e\x1a\x91\xf5v\x0f\xce\x1fs\xd1>\x8d\x9a<r4G\xd5!\x03"] 2022-01-22 13:27:52 [info ] [TNC] ARQ | RX | ACK 2022-01-22 13:27:52 [debug ] [TNC] mod out shorter than audio buffer delta=3840 2022-01-22 13:27:53 [debug ] ACK RECEIVED.... Exception in thread worker thread receive: Traceback (most recent call last): File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner self.run() File "/usr/lib/python3.9/threading.py", line 892, in run self._target(*self._args, **self._kwargs) File "/mnt/disk1/SuperBoxStuff/git/FreeDATA/tnc/data_handler.py", line 115, in worker_receive self.process_data(bytes_out=data[0],freedv=data[1],bytes_per_frame=data[2]) File "/mnt/disk1/SuperBoxStuff/git/FreeDATA/tnc/data_handler.py", line 152, in process_data self.burst_ack_received(bytes_out[:-2]) File "/mnt/disk1/SuperBoxStuff/git/FreeDATA/tnc/data_handler.py", line 634, in burst_ack_received self.speed_level += 1 AttributeError: 'DATA' object has no attribute 'speed_level'

Here is a video of the setup producing this:

(Audio is from iOS RTLSDR app running in background so there's an audio delay)

https://youtu.be/94nLPF_2f7Q

DJ2LS commented 2 years ago

Important line:

"/mnt/disk1/SuperBoxStuff/git/FreeDATA/tnc/data_handler.py", line 634, in burst_ack_received self.speed_level += 1 AttributeError: 'DATA' object has no attribute 'speed_level'
DJ2LS commented 2 years ago

This needs to be fixed with #128

DJ2LS commented 2 years ago

Bug fix: Adding self.speed_level to class init

DJ2LS commented 2 years ago

Should have been fixed with latest commit. Needs some testing from branch ls-leave-prototype

pe1rrr commented 2 years ago

Switched to branch 'ls-leave-prototype'

Your branch is up to date with 'origin/ls-leave-prototype'.

pi@superbox:~/git/FreeDATA/tnc $ python3.9 daemon.py

[C2 ] Codec2 library found - libcodec2.so 2022-01-22 23:42.15 [info ] [DMN] Starting TCP/IP socket port=3001

Exception in thread UPDATE_SERIAL_DEVICES:

Traceback (most recent call last):

File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner self.run() File "/usr/lib/python3.9/threading.py", line 892, in run self._target(*self._args, **self._kwargs) File "/mnt/disk1/SuperBoxStuff/git/FreeDATA/tnc/daemon.py", line 56, in update_serial_devices crc_hwid = crc_algorithm(bytes(hwid, encoding='utf-8')) NameError: name 'crc_algorithm' is not defined

DJ2LS commented 2 years ago

Thanks for the feedback. This branch is highly experimental because of some huge changes! But your feedback really helps!

pe1rrr commented 2 years ago

Thanks for the feedback. This branch is highly experimental because of some huge changes! But your feedback really helps!

I am fly by the seat of my pants when it comes to testing and doing things with 'git', I wasn't sure if the message was directed at me before as it just arrives in my email inbox as-is. Happy to help test, as we can see the TNC daemon on this branch doesn't start up at all here before throwing the error.

I am testing HF path, also along with G8BPQ as we are on the chat node together.

DJ2LS commented 2 years ago

@pe1rrr Great hearing this! I'm also in contact with him, so this really helpfully! Btw I fixed this issue. Ready for the next one :smile: .

DJ2LS commented 2 years ago

@pe1rrr feel free commenting directly within the pull request https://github.com/DJ2LS/FreeDATA/pull/128

pe1rrr commented 2 years ago

Bit more testing done on the ls-leave-prototype branch with sending a file over a wobbly HF path, this seems to throw the TNC off big time! Here is a screen recording demonstrating this, the issue seem to be that the receiving TNC might be reconstructing already received frames and appending them to the incoming file making the progress bar complete further than it should be.

The TXed and RXed bytes do not match up either.

The TXing in this situation is on a radio with mic compression which sometimes causes the signal to have to drop back to slower mode, sort of an incidental HF path quality simulation (the rig is remote and that setting isnt switchable hi hi), also there are some audio buffer drop outs that contribute to this switching back and forth.

There also appears to be a problem with my remote system trying to access a jackd audio device, i do no have jack installed on that system so I am not sure why it is trying to poll the shared Memory thingy (see recording).

https://youtu.be/ilnRSSXVDS4

(Give 10 mins after this post time for the HD video to be available)

drowe67 commented 2 years ago

The TXing in this situation is on a radio with mic compression which sometimes causes the signal to have to drop back to slower mode, sort of an incidental HF path quality simulation (the rig is remote and that setting isnt switchable hi hi), also there are some audio buffer drop outs that contribute to this switching back and forth.

It's important to have a very clean Tx path - no filters or compression, and adjust Tx drive so the ALC is just tickled. More power is not better. This is also a common problem with FreeDV set up - please see the FreeDV User Manual.

pe1rrr commented 2 years ago

The TXing in this situation is on a radio with mic compression which sometimes causes the signal to have to drop back to slower mode, sort of an incidental HF path quality simulation (the rig is remote and that setting isnt switchable hi hi), also there are some audio buffer drop outs that contribute to this switching back and forth.

It's important to have a very clean Tx path - no filters or compression, and adjust Tx drive so the ALC is just tickled. More power is not better. This is also a common problem with FreeDV set up - please see the FreeDV User Manual.

Sure, I am aware of this, as I clearly pointed out in the part you quoted!

DJ2LS commented 2 years ago

@pe1rrr I pushed a hot fix for rigctld! It should now run more stable, so you can hopefully test a little bit better

DJ2LS commented 2 years ago

Bit more testing done on the ls-leave-prototype branch with sending a file over a wobbly HF path, this seems to throw the TNC off big time! Here is a screen recording demonstrating this, the issue seem to be that the receiving TNC might be reconstructing already received frames and appending them to the incoming file making the progress bar complete further than it should be.

The reason for the not matching progress-bar could be, that the statistics will be updated as soon as a file has been transmitted (RX side ) or a transmission of a frame starts ( TX ). I could change this, so updating the stats on TX will be done when a transmission of a frame finished.,

The TXed and RXed bytes do not match up either.

Can you reproduce this? If you are sending the file two or three times, will the transmission still fail? If so, could you send me the file + the error log ( within tnc folder ), so I can test it as well?

There also appears to be a problem with my remote system trying to access a jackd audio device, i do no have jack installed on that system so I am not sure why it is trying to poll the shared Memory thingy (see recording).

Yes, I'm not sure, why portaudio ( the audio library I'm using ) is trying to access JACK. This also causes problems on Suse Linux. But it seems normally not affecting usability. However, we need to keep this in mind for a possible later fix, if its not affecting testing too much.

pe1rrr commented 2 years ago

Ok - Starting fresh from a new pull of ls-leave-prototype

First as an aside- I am chasing a segfault, I have managed to see both TNCs segfault after some idling, but I will open an issue with the backtrace when I catch it again, trying to reproduce it is challenging, but its lurking out there. Heads up :)

I have found that using my test file: a highly compressed executable, a piece of self-extracting-and-playing music I created some time ago. This file is sent in one direction from one TNC to another after initialising them freshly.

Upon sending the same file (from the received directory) back to the originating TNC, the file fails to transfer and leaves the TNCs in a blue-bar situation. The progress bar sticks at 100%, the TNC on the transmitting side reckons it is complete, but the receiving side never gets the file.

I have screen recorded this session so hopefully it is of some help.

https://youtu.be/pEpttp1_ZlU RedHeat_-_Gravity_Assist.zip

Attached: The executable may trigger a virus warning, its compressed for a competition in the demoscene using crinkler , its not a virus.

DJ2LS commented 2 years ago

Thanks, I will have a closer look tomorrow. I recommend not using "auto" mode for sending data on the raspberry pi4 yet, as it seems it has problems with processing speed. This is a problem, we may need to optimise some parts of codec2 ( see the discussion on GitHub 'vectorisation' for further information ) and maybe the FreeDATA protocol as well. When running in auto mode, the modem listens to several modes in parallel, which causes an audio buffer overflow on weak machines. For this reason you should select one of the data channels - they are less cpu consuming. This should make it easier for me debugging problems and you are hopefully not running into buffer overflows and therefore packet loss. Focusing on protocol and modem issues is another part.

DJ2LS commented 2 years ago

okay, I found the origin of the problem ( its a network problem on the gui side ) I would suggest, that we need to differ between a tnc problem and a gui problem for future bug reporting. If a file transfer fails, then we should check if it was successfull or not by checking the CLI. There is also a log file within the FreeDATA/tnc folder.

Thanks for the bug report. This one is helping me, improving the network command stack and network handling! :smiley:

DJ2LS commented 2 years ago

With the latest commit, the problems mentioned above should be solved.

pe1rrr commented 2 years ago

With the latest commit, the problems mentioned above should be solved.

There is a problem with the communication between the GUI and the TNC after sending a successful file transfer it has problems in the receiving role, as when in the receiving role, the GUI does not write the file to disk, nor does it display any status messages during the file transfer.

Lets say from FRED to LENNY is successful, then sending back the file, from LENNY to FRED, FRED's GUI doesn't show "Incoming File..." or "Receiving File..." nor does it write the file to disk after the TNCs have transferred it.

Doing this the opposite way around also reproduces this issue. Whichever TNC sent the file first has a problem with TNC file transfer status communication when switching to the receiving role.

What may be odd is the GUI does still show the bitrate and progress bar moving.

Video: https://youtu.be/eaiMA4HeJMY

pe1rrr commented 2 years ago

Additional observation: While writing the previous comment report, I returned to the desktop to find that the TNC that would have been receiving and writing the file to disk (but didn't) had segfaulted, however the TNC's thread displaying messages to the console was still running, until I killed it.

image

This could be the root of the segfaults I have been coming across recently.

DJ2LS commented 2 years ago

Additional observation: While writing the previous comment report, I returned to the desktop to find that the TNC that would have been receiving and writing the file to disk (but didn't) had segfaulted, however the TNC's thread displaying messages to the console was still running, until I killed it.

image

This could be the root of the segfaults I have been coming across recently.

I'm currently looking for a raspberry pi 4, so I can do some additional testing with this device. But this needs some more time as they are rare and (too) expensive. Devices with fair prices are hard to find.

pe1rrr commented 2 years ago

Additional observation: While writing the previous comment report, I returned to the desktop to find that the TNC that would have been receiving and writing the file to disk (but didn't) had segfaulted, however the TNC's thread displaying messages to the console was still running, until I killed it.

image

This could be the root of the segfaults I have been coming across recently.

I'm currently looking for a raspberry pi 4, so I can do some additional testing with this device. But this needs some more time as they are rare and (too) expensive. Devices with fair prices are hard to find.

Are you implying that the issue I reported (not the jackd issue thats in the screenshot) is a problem specifically with Raspberry Pi4s? I am not sure what to make of your comment, honestly. 😅 Erm.. What are you addressing specifically that is related to the RPI?

DJ2LS commented 2 years ago

Sorry for my misleading reply 🙂 The tnc itself isn't writing files to the disk. It just holds them in a memory buffer. Files will be written to the disk buy gui. These errors can't occur within the tnc cli.

The jack audio errors occures mainly on raspberry pi and sometimes on Suse Linux. But I wasn't able reproducing this, yet. It may be possible, portaudio ran into segfault for some reason ( maybe because of jack? ), but the other threads are still running ( as you already figured out ). So the problem is possibly portaudio combined with jack audio 🤔 but I'm not sure as long as I can't reproduce it.

DJ2LS commented 2 years ago

However, the files not beeing written to disk is gui related. I will have a look at this. I changed some network related things and this seem to be the origin. So in this case we have to problems. Jack/Portaudio on tnc side and files not beeing written to disk on gui side

DJ2LS commented 2 years ago

@pe1rrr I did a short test with a virtual raspberry pi OS and it seems I'm also getting jack errors as well. So I have a way working on this without owning an raspberry pi ( however, the cpu platform is still not ARM. For a native test on a raspberry pi I need to order one ).

DJ2LS commented 2 years ago

With the latest commit, the problems mentioned above should be solved.

There is a problem with the communication between the GUI and the TNC after sending a successful file transfer it has problems in the receiving role, as when in the receiving role, the GUI does not write the file to disk, nor does it display any status messages during the file transfer.

Lets say from FRED to LENNY is successful, then sending back the file, from LENNY to FRED, FRED's GUI doesn't show "Incoming File..." or "Receiving File..." nor does it write the file to disk after the TNCs have transferred it.

Doing this the opposite way around also reproduces this issue. Whichever TNC sent the file first has a problem with TNC file transfer status communication when switching to the receiving role.

What may be odd is the GUI does still show the bitrate and progress bar moving.

Video: https://youtu.be/eaiMA4HeJMY

I did a test here several times and can't reproduce the problem :thinking: We should keep this in mind, maybe it will occur again.

DJ2LS commented 2 years ago

Minor protocol change (not finished yet) with latest commit.

pe1rrr commented 2 years ago

Minor protocol change (not finished yet) with latest commit.

Wow great improvement in the latest commit, very impressive update- thanks for the added feature options with the waterfall. I have been able to transfer files from and to an instance in quick succession without running into code-related problems.

What I am running into is probably specific to the hardware here as you know, and very likely because my machines are quite busy.

With a big file, A retry loops occurs and the receiving TNC seems to go "deaf" to the packets being re-sent, so the retry counter on the TX side increases until it hits max-retries, and the RX side sending an ACK sometimes may key over the retry. Its probably a buffer issue.

Here is a snippet of it happening in the video: https://youtu.be/EV3FexoqGzI

Other than that, the whole thing is really coming together. We are QRV on 14.1022 USB with ax25 at 1100hz and 2000hz with the VARA/ARDOP/etc fixed at 1500hz, I hope FreeDATA will be able to fit as well in that space. Hopefully we can do some OTA DX testing with each other soon when condx improve.

Edit: here is a visual and why its that specific frequency: https://eindhoven.space/2021/12/24/where-to-find-packet-on-20-meter-band/

DJ2LS commented 2 years ago

@pe1rrr Thanks for your feedback shortly after pushing the commit 👍 Yes, the problem with the cpu and stuck audio is still a problem. But its hard to debug, but I came closer to the source of the problem. Maybe its not the CPU, but on slower CPUs several circumstances come together so the problem occurs more often. However, the circle around the bug is getting closer 🎯 The problem with the overlapping ACK/NACK and data may have its origin in timeouts for the new protocol. I didn't test this yet on slow cpu devices, so the timeouts may be too short. But I ordered a raspberry pi 4 now and hopefully this will help me in increasing support for this platform. You can also adjust the timeouts by yourself if you want and giving me feedback: https://github.com/DJ2LS/FreeDATA/blob/8384bf8d12e35ac4a3ad8ea1423d5af060f8463c/tnc/data_handler.py#L53

If I understand you correctly, then theres a need for adjusting the frequency range where we want to transmit so different modes can run on the same center frequency? I didn't work in this yet, so this might be another feature which is/could be required!

pe1rrr commented 2 years ago

@pe1rrr Thanks for your feedback shortly after pushing the commit 👍

My pleasure !

Yes, the problem with the cpu and stuck audio is still a problem. But its hard to debug, but I came closer to the source of the problem. Maybe its not the CPU, but on slower CPUs several circumstances come together so the problem occurs more often. However, the circle around the bug is getting closer 🎯

👍

The problem with the overlapping ACK/NACK and data may have its origin in timeouts for the new protocol. I didn't test this yet on slow cpu devices, so the timeouts may be too short. But I ordered a raspberry pi 4 now and hopefully this will help me in increasing support for this platform.

You can also adjust the timeouts by yourself if you want and giving me feedback:

https://github.com/DJ2LS/FreeDATA/blob/8384bf8d12e35ac4a3ad8ea1423d5af060f8463c/tnc/data_handler.py#L53

I will give that a try and see how/if it has an effect.

If I understand you correctly, then theres a need for adjusting the frequency range where we want to transmit so different modes can run on the same center frequency? I didn't work in this yet, so this might be another feature which is/could be required!

We are hoping that it will be at least centered on 1500hz, and hopefully be able to go as narrow at 500hz as that is the max possible in this configuration without directly QRMing the ax25 channels, VARA500, ARDOP500 and RPR300 I think fit ok so hopefully FreeDATA might too.

DJ2LS commented 2 years ago

Thanks for another use case! We might create another issue for this so we don't forget it. This is also a topic for the documentation

DJ2LS commented 2 years ago

Ah and I'm actually working on a 500Hz mode as well for circumstances where only small bandwidth are allowed or usable.

drowe67 commented 2 years ago

Ah and I'm actually working on a 500Hz mode as well for circumstances where only small bandwidth are allowed or usable.

Is there a SNR or bit rate requirement for the 500 Hz mode? For example using QAM it could be higher SNR/ higher bit rate.

DJ2LS commented 2 years ago

@pe1rrr Id like to close this issue, as we have mixed topics in this issue and some are outdated. So we keep things a little bit more structured.

pe1rrr commented 2 years ago

I was thinking the same thing! 😎 Let's do this.