SOTAmat / SOTAcat

CAT control for Elecraft KX radios and SOTAMAT
https://sotamat.com
Other
9 stars 2 forks source link

Add locks to fix #25 #26

Closed poynting closed 3 months ago

poynting commented 4 months ago

Add locks in handler_prepareft8_post. Verify lock status in get_from_kx and put_to_kx commands, since they are called outside of kx_commands.

jeffkowalski commented 4 months ago

First off, thank you 100% for jumping on this. You're a star. Your addition of locks in the preparation code is spot-on. We also need locks while transmitting and for cleanup (which were present). The lock()/unlock() style, though is clumsy, so let's instead use the stack-allocated-initialized-and-released form, which provides extra safety as the locker goes out of scope. I'll commit a similar fix, with that style, and you can test to see if it passes muster on your system as it does on mine.

Second, your error messages about non-locked port while we're calling kx functions is very well-intentioned! I think it's redundant though. The present error messages occur specifically around all accesses to the uart - the critical resource, but your proposal brings up an interesting opportunity to move the warnings exclusively to the "api" level in kx-commands, rather than at the lowest level. Perhaps that's more sane. I'll write up an issue for that, but leave the errors where they are for now.

Also for review - you add CommandInProgress = false; to the end of the prep code. I'll keep that, but I think the entire logic of "we have an FT8 going on" is suspect, and needs reconsideration. Too many globals and threads, imho. I'll write up a separate issue for that, but presume your addition is beneficial.

poynting commented 4 months ago

I'll admit I didn't fully track the whole call stack in the time I had available. It seemed the lock/unlock method and the on-stack method were used at different times, and I wasn't sure what the design intent was so I tried to follow along what else was closest in the code. If there wasn't any specific intent for the different use cases, that helps clear up my understanding, but if there was, I'd be curious to hear what it was.

As far as CommandInProgress = false; this was necessary for proper operation. My first few attempts at locking at the lower level using the on-stack method resulted in unexpected "user timeout, going to sleep" events. I ultimately realized that the flag wasn't being reset prior to function return and this was causing the unexpected timeout, though I'm still not totally sure of the method. It seemed the blinks value was getting set to a very large number >4 and causing an instant shutdown. I guess this could happen if the now time was before the action time due to a threading issue.

Since we've moved to C++ there are likely some design patterns that can help with this, but ultimately it seems like we want all user actions to have access to the same watchdog and kick it when an action occurs. If the only user actions are http accesses, there may be an LWIP function that we can define to be called pre- or post-request handling, which may help make this more generic.

jeffkowalski commented 4 months ago

We are on exactly the same page. Please test if this sole-file change of mine works for you. If not, let's refine it per the preferred pattern. And above all - thank you for your interest and contribution!!!

poynting commented 4 months ago

It doesn't appear this resolved the issue, but it did change the behavior. It now goes to sleep upon finishing the FT8 setup. I saw this in my testing as well.

If I disable the deep sleep block:

if(blinks>4)
   ...
   enter_deep_sleep()

then it does send the messages, but I'm left with a rapidly and continuously blinking LED.

V (58455) sc:webserve: trace: my_http_request_handler() with URI: /api/v1/prepareft8?messageText=STM%20K5EM%2FY1GS&timeNow=1709796062032&rfFrequency=21074000&audioFrequency=1553
V (58465) sc:hdl_ft8.: trace: handler_prepareft8_post()
V (58465) sc:hdl_ft8.: request buffer[93] = "messageText=STM%20K5EM%2FY1GS&timeNow=1709796062032&rfFrequency=21074000&audioFrequency=1553"
I (58465) sc:kx_cmds.: locking radio
I (58465) sc:kx_cmds.: radio LOCKED --
V (58465) sc:kx_cmds.: trace: get_kx_state()
V (58465) sc:kx_cmds.: trace: get_from_kx()
V (58465) sc:kx_cmds.: trace: uart_get_command()
I (58475) sc:kx_cmds.: command 'FT;' returned 'FT0' after 10.522 ms
I (58475) sc:kx_cmds.: kx command 'FT' returns 0
V (58475) sc:kx_cmds.: trace: get_from_kx()
V (58475) sc:kx_cmds.: trace: uart_get_command()
I (58495) sc:kx_cmds.: command 'FA;' returned 'FA00021062000' after 23.630 ms
I (58495) sc:kx_cmds.: kx command 'FA' returns 21062000
V (58495) sc:kx_cmds.: trace: get_from_kx_menu_item()
V (58495) sc:kx_cmds.: put_to_kx(MN) attempting value 58
V (58495) sc:kx_cmds.: trace: get_from_kx()
V (58495) sc:kx_cmds.: trace: uart_get_command()
I (58545) sc:kx_cmds.: command 'MN;' returned 'MN058' after 45.915 ms
I (58545) sc:kx_cmds.: kx command 'MN' returns 58
I (58545) sc:kx_cmds.: command 'MN' successful; value = 58
V (58545) sc:kx_cmds.: trace: get_from_kx()
V (58545) sc:kx_cmds.: trace: uart_get_command()
I (58565) sc:kx_cmds.: command 'MP;' returned 'MP020' after 13.835 ms
I (58565) sc:kx_cmds.: kx command 'MP' returns 20
V (58565) sc:kx_cmds.: put_to_kx(MN) attempting value 255
V (58565) sc:kx_cmds.: trace: get_from_kx()
V (58565) sc:kx_cmds.: trace: uart_get_command()
I (58585) sc:kx_cmds.: command 'MN;' returned 'MN255' after 21.022 ms
I (58585) sc:kx_cmds.: kx command 'MN' returns 255
I (58585) sc:kx_cmds.: command 'MN' successful; value = 255
V (58585) sc:kx_cmds.: trace: get_from_kx()
V (58585) sc:kx_cmds.: trace: uart_get_command()
I (58595) sc:kx_cmds.: command 'MD;' returned 'MD3' after 8.077 ms
I (58595) sc:kx_cmds.: kx command 'MD' returns 3
V (58595) sc:kx_cmds.: put_to_kx(MD) attempting value 3
V (58595) sc:kx_cmds.: trace: get_from_kx()
V (58595) sc:kx_cmds.: trace: uart_get_command()
I (58965) sc:kx_cmds.: command 'MD;' returned 'MD3' after 373.325 ms
I (58965) sc:kx_cmds.: kx command 'MD' returns 3
I (58965) sc:kx_cmds.: command 'MD' successful; value = 3
V (58965) sc:kx_cmds.: trace: get_from_kx()
V (58965) sc:kx_cmds.: trace: uart_get_command()
I (58975) sc:kx_cmds.: command 'AP;' returned 'AP0' after 8.768 ms
I (58975) sc:kx_cmds.: kx command 'AP' returns 0
V (58975) sc:kx_cmds.: put_to_kx(MD) attempting value 3
V (58975) sc:kx_cmds.: trace: get_from_kx()
V (58975) sc:kx_cmds.: trace: uart_get_command()
I (59355) sc:kx_cmds.: command 'MD;' returned 'MD3' after 372.999 ms
I (59355) sc:kx_cmds.: kx command 'MD' returns 3
I (59355) sc:kx_cmds.: command 'MD' successful; value = 3
V (59355) sc:kx_cmds.: put_to_kx(FR) attempting value 0
V (59355) sc:kx_cmds.: trace: get_from_kx()
V (59355) sc:kx_cmds.: trace: uart_get_command()
I (59555) sc:kx_cmds.: command 'FR;' returned 'FR0' after 201.627 ms
I (59555) sc:kx_cmds.: kx command 'FR' returns 0
I (59555) sc:kx_cmds.: command 'FR' successful; value = 0
V (59555) sc:kx_cmds.: put_to_kx(FT) attempting value 0
V (59555) sc:kx_cmds.: trace: get_from_kx()
V (59555) sc:kx_cmds.: trace: uart_get_command()
I (59755) sc:kx_cmds.: command 'FT;' returned 'FT0' after 202.288 ms
I (59765) sc:kx_cmds.: kx command 'FT' returns 0
I (59765) sc:kx_cmds.: command 'FT' successful; value = 0
V (59765) sc:kx_cmds.: put_to_kx(FA) attempting value 21075553
V (59765) sc:kx_cmds.: trace: get_from_kx()
V (59765) sc:kx_cmds.: trace: uart_get_command()
I (59795) sc:kx_cmds.: command 'FA;' returned 'FA00021075550' after 34.390 ms
I (59795) sc:kx_cmds.: kx command 'FA' returns 21075550
I (59795) sc:kx_cmds.: command 'FA' successful; value = 21075550
V (59795) sc:kx_cmds.: put_to_kx(MD) attempting value 3
V (59795) sc:kx_cmds.: trace: get_from_kx()
V (59795) sc:kx_cmds.: trace: uart_get_command()
I (60175) sc:kx_cmds.: command 'MD;' returned 'MD3' after 374.940 ms
I (60175) sc:kx_cmds.: kx command 'MD' returns 3
I (60175) sc:kx_cmds.: command 'MD' successful; value = 3
V (60175) sc:kx_cmds.: put_to_kx(AP) attempting value 1
V (60175) sc:kx_cmds.: trace: get_from_kx()
V (60175) sc:kx_cmds.: trace: uart_get_command()
V (60295) sc:hdl_batt: battery voltage: 4.161 V
V (60295) sc:idletask: heap: 288916 (used 84312, free 204604) [bytes]
V (60295) sc:idletask: blinks 3799547
I (60385) sc:kx_cmds.: command 'AP;' returned 'AP1' after 209.384 ms
I (60385) sc:kx_cmds.: kx command 'AP' returns 1
I (60385) sc:kx_cmds.: command 'AP' successful; value = 1
I (60385) sc:kx_cmds.: -- radio UNLOCKED
V (60385) sc:hdl_ft8.: trace: cleanup_ft8_task()
I (60385) sc:hdl_ft8.: successful preparation
I (60665) sc:idletask: powering off due to inactivity
V (60665) sc:sleep...: trace: enter_deep_sleep()
I (60665) sc:sleep...: preparing for deep sleep:
I (60665) wifi:state: run -> init (0)
I (60665) wifi:pm stop, total sleep time: 43112222 us / 55270976 us

I (60665) wifi:<ba-del>idx:0, tid:0
I (60665) wifi:<ba-del>idx:1, tid:7
I (60665) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
V (60665) sc:wifi....: trace: wifi_event_handler()
I (60665) sc:wifi....: wifi event handler called with event_base: 'WIFI_EVENT', event_id: 5
I (60665) sc:wifi....: disconnected from WiFi network
V (60665) sc:wifi....: trace: wifi_event_handler()
I (60665) sc:wifi....: wifi event handler called with event_base: 'WIFI_EVENT', event_id: 3
W (60665) httpd_txrx: httpd_sock_err: error in recv : 113
I (60725) wifi:flush txq
I (60725) wifi:stop sw txq
I (60725) wifi:lmac stop hw txq
I (60725) sc:sleep...: wifi is stopped.
V (60725) sc:setupadc: trace: shutdown_adc()
I (60725) sc:setupadc: adc deregister curve fitting calibration scheme
I (60725) sc:sleep...: adc is shutdown.
I (60725) sc:sleep...: all gpio pins off and isolated.
I (60725) sc:sleep...: entering deep sleep...
I (60725) sc:sleep...: goodnight!
jeffkowalski commented 4 months ago

V (60295) sc:idletask: blinks 3799547 That is suspect.
Blinks should be <=4 according to idle_status_task.cpp:

        int blinks = ceil((now - LastUserActivityUnixTime) / (AUTO_SHUTDOWN_TIME_SECONDS / 4.0));
        ESP_LOGV(TAG8, "blinks %d", blinks);
        if (blinks > 4)
        {
//// shutdown

There's corruption somewhere, I think.

jeffkowalski commented 4 months ago

Looking at your wrapping lock()/unlock() in handler_prepare_ft8_post(), versus my scoped lock in the same region, I see no material difference?

poynting commented 4 months ago

I don't see a difference either. Adding some telemetry via some extra log messages seems to have "fixed" it, but I also inadvertently did a full rebuild in-between.

poynting commented 4 months ago

After a complete clean and rebuild, I have different behavior. I see a lock message warning per below just after boot/radio connection, but I don't see the sleep or rapid blue LED blink behavior again.

The FT8 sending does work now.

I'll keep an eye out, but at the moment it's not reproducible.

I (6875) wifi:<ba-add>idx:1 (ifx:0, 8a:51:dd:ca:18:c6), tid:7, ssn:2, winSize:64
V (6885) sc:webserve: trace: my_http_request_handler() with URI: /api/v1/frequency
V (6885) sc:hdl_freq: trace: handler_frequency_get()
I (6885) sc:kx_cmds.: locking radio
V (6895) sc:uartconn: received 9 bytes: RVR99.99;
I (6895) sc:uartconn: correct baud rate found: 38400
I (6895) sc:kx_cmds.: radio LOCKED --
V (6895) sc:kx_cmds.: trace: get_from_kx()
V (6895) sc:kx_cmds.: trace: uart_get_command()
I (6895) sc:kx_cmds.: -- radio UNLOCKED
I (6895) sc:setup...: radio connection established
I (6895) sc:kx_cmds.: locking radio
E (6915) sc:kx_cmds.: bad result from command 'FA;' after 22.539 ms, returned bytes=14, out_buff=;FA000...
I (6915) sc:kx_cmds.: Retrying...
V (6915) sc:kx_cmds.: trace: empty_kx_input_buffer()
E (6915) sc:kx_cmds.: RADIO PORT NOT LOCKED! (coding error in caller)
V (7915) sc:kx_cmds.: empty_kx_input_buffer() called, ate 1 bytes in 1000 ms with chars: ;
V (7915) sc:kx_cmds.: trace: uart_get_command()
E (7915) sc:kx_cmds.: RADIO PORT NOT LOCKED! (coding error in caller)
I (7935) sc:kx_cmds.: command 'FA;' returned 'FA00007062500' after 20.662 ms
I (7935) sc:kx_cmds.: kx command 'FA' returns 7062500
I (7935) sc:kx_cmds.: -- radio UNLOCKED
I (7935) sc:hdl_freq: returning frequency: 7062500
I (7935) sc:kx_cmds.: radio LOCKED --
V (7935) sc:kx_cmds.: trace: empty_kx_input_buffer()
V (7975) sc:webserve: trace: my_http_request_handler() with URI: /api/v1/rxBandwidth
V (7975) sc:hdl_mode: trace: handler_rxBandwidth_get()
V (7975) sc:hdl_mode: trace: get_radio_mode()
I (7975) sc:kx_cmds.: locking radio
V (8535) sc:kx_cmds.: empty_kx_input_buffer() called, ate 0 bytes in 600 ms with chars: 
jeffkowalski commented 4 months ago

weird that what should be initial radio connection established, and subsequent empty_kx_input_buffer messages would appear after handler_frequency_get() Maybe that's a some leftover thing running to make that get call before we've really started? I guess we might need some trivial rejection code sprinkled about? What do you think?

jeffkowalski commented 4 months ago

This reads somewhat line two threads outputting messages. I think the uart_connect isn't locking the uart.

Yes. I think uart_connect should really be part of the kx_commands. I'm thinking kx_commands are better as a class with public Radio::connect (i.e. uart-connect), Radio::get, and Radio::put methods.

poynting commented 4 months ago

Agree. Having a single radio interfacing class seems like a good idea.

I deleted my previous because I realized uart_connect was calling a lock at the very top, but something is fishy. A poorly timed request for connection status from the already running http server could potentially want to interrupt that process, but the lock should have prevented that.

It seems we need to have one class with a thread with sole access to the UART, and the kx class seems to be the right place. KX commands can then generate messages to send and handle reception. There still needs to be a lock on the whole radio interface since other classes may need to execute a whole set of commands that can't be interrupted. This kx class can also maintain a heartbeat to the radio so we can detect radio disconnect as well.

On Thu, Mar 7, 2024, 12:53 AM Jeff Kowalski @.***> wrote:

This reads somewhat line two threads outputting messages. I think the uart_connect isn't locking the uart.

Yes. I think uart_connect should really be part of the kx_commands. I'm thinking kx_commands are better as a class with public Radio::connect (i.e. uart-connect), Radio::get, and Radio::put methods.

— Reply to this email directly, view it on GitHub https://github.com/SOTAmat/SOTAcat/pull/26#issuecomment-1982999902, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSNZUYYJNSPRYYUPSOPOCLYXATJRAVCNFSM6AAAAABEIPOZTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBSHE4TSOJQGI . You are receiving this because you authored the thread.Message ID: @.***>

jeffkowalski commented 3 months ago

resolving this pull request now that sufficient locks are in place. @poynting - your initial work was just fine. The subsequent implementation also seems to address the problem.

Related, but not dependent: Note that #29 still aims to warn about insufficient locks at a higher level in the API, perhaps alerting more precisely about improper locking.