esprfid / esp-rfid

ESP8266 RFID (RC522, PN532, Wiegand, RDM6300) Access Control system featuring WebSocket, JSON, NTP Client, Javascript, SPIFFS
MIT License
1.35k stars 423 forks source link

Still cannot restore users with over 100 #557

Closed windy54 closed 10 months ago

windy54 commented 1 year ago

I have re-opened this issue, see 448.

Background. We currently use the software with a 522 reader and have 120 Hackspace members. A member has just donated several wiegand readers and I have been investigating what we have to do . For information, it is a wiegand 32 bit reader and I have had to modify the wiegand library. The 522 reader outputs the UID in little median format, the wiegand in big endian format. So I have created a script to read in the existing database, convert the UID’s and write it back out. When I try and restore it using the web interface the system crashes . I have not tried it yet but I have been able to update this number of users in the past over MQTT.

so how to proceed?

I have investigated this in the past and there seems to be some handshaking going on, I.e. a user is read from the file and the next one is transmitted only when a response is received.

matjack1 commented 1 year ago

@windy54 are you using the backup/restore functionality? I've never tried it, I'll try with a test file of around 100 users and I'll let you know

windy54 commented 1 year ago

Yes, I have tried it from a raspberry PI and Mac, using chromium, chrome and safari.

Cheers Steve Gale

On 20 Nov 2022, at 21:08, Matteo Giaccone @.***> wrote:

 @windy54 are you using the backup/restore functionality? I've never tried it, I'll try with a test file of around 100 users and I'll let you know

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

windy54 commented 1 year ago

So I am assuming this has not been fixed in the latest demo build, apologies if it has.

I am investigating this myself and after loading (say) 10 users an exception (3) is generated, stack overflow. `[ DEBUG ] userfile received {"command":"userfile","uid":"f479340","user":"Phillip Hayward No.45","acctype":1,"validuntil":2145916800}[ DEBUG ] userfile saved

Exception (3): epc1=0x40101199 epc2=0x00000000 epc3=0x00000000 excvaddr=0x4000f230 depc=0x00000000`

I will try re-building with an increased stack size, once I find out where it is set :)

Steve

matjack1 commented 1 year ago

@windy54 I'm working on a PR that might fix this! More details here: https://github.com/esprfid/esp-rfid/discussions/572

Please stay tuned, I might be able to publish it this week :)

matjack1 commented 1 year ago

@windy54 I think this PR: https://github.com/esprfid/esp-rfid/pull/577 should fix your issue.

I'm going to close this, but please reopen if it's not fixed. Thank you very much!

windy54 commented 1 year ago

Hi Matt,I will download the latest and try it out.CheersSteve GaleOn 22 Jan 2023, at 21:16, Matteo Giaccone @.***> wrote: @windy54 I think this PR: #577 should fix your issue. I'm going to close this, but please reopen if it's not fixed. Thank you very much!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

windy54 commented 1 year ago

Hi Matt, just downloaded the demo binary and it worked first time, 150 users restored.

Thanks alot

Steve

On Sun, 22 Jan 2023 at 21:16, Matteo Giaccone @.***> wrote:

@windy54 https://github.com/windy54 I think this PR: #577 https://github.com/esprfid/esp-rfid/pull/577 should fix your issue.

I'm going to close this, but please reopen if it's not fixed. Thank you very much!

— Reply to this email directly, view it on GitHub https://github.com/esprfid/esp-rfid/issues/557#issuecomment-1399610711, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI2YS7D7GVI4DEFTN7MUW3WTWPSZANCNFSM6AAAAAASF7PFLQ . You are receiving this because you were mentioned.Message ID: @.***>

matjack1 commented 1 year ago

Great! Thank you for the feedback :)

windy54 commented 1 year ago

I might have been too quick, tried to show members at the Hackspace and it failed. Same symptoms as before, stopped after a few members and then reset. Could not monitor the serial to see what was happening, so I will do that and get back to you. Cheers

matjack1 commented 1 year ago

ahaha, I was surprised that it went so smooth :)

Yes, if you can get me some logs using the debug build and the stack trace I'm going to check

windy54 commented 1 year ago

Hi Matteo,

Another error log.

This time I have downloaded the source and re-built it. Edited config.esp to set config.pinrequested = false.

I logged in and clicked on users and the system crashed and re-booted.

Captured the following:

ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset load 0x4010f000, len 1392, room 16 tail 0 chksum 0xd0 csum 0xd0 v3d128e5c ~ld

[ INFO ] ESP RFID v2.0.0-dev Flash real id: 001640E0 Flash real size: 4194304

Flash ide size: 4194304 Flash ide speed: 40000000 Flash ide mode: DIO Flash Chip configuration ok.

[ INFO ] Config file found [ INFO ] Trying to setup RFID Hardware [ INFO ] RFID SS_PIN: 15 and Gain Factor: 32 [ INFO ] MFRC522 Version: 0x12 (unknown) [ INFO ] Configuration done. [ INFO ] ESP-RFID is running in AP Mode [ INFO ] Configuring access point... Ready [ INFO ] AP IP address: 192.168.4.1 [ INFO ] AP SSID: esp-rfid [ INFO ] sys | System setup completed, running | [ INFO ] door | Door Closed |

It appears to me that the watchdog is re-setting.

Steve

On Mon, 23 Jan 2023 at 20:55, Matteo Giaccone @.***> wrote:

ahaha, I was surprised that it went so smooth :)

Yes, if you can get me some logs using the debug build and the stack trace I'm going to check

— Reply to this email directly, view it on GitHub https://github.com/esprfid/esp-rfid/issues/557#issuecomment-1400966558, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI2YS3Z3BEQACNRYHBEMKDWT3V4PANCNFSM6AAAAAASF7PFLQ . You are receiving this because you were mentioned.Message ID: @.***>

matjack1 commented 1 year ago

Hey Steve, thank you for helping out.

Just to double check, are you using the code from the PR #577, not dev, right?

Then, can you please paste the stacktrace and what you see in the logs before the reboot?

From what you are sharing looks like the watchdog reset, maybe I need to try with more users to try and replicate what you see. I couldn't hit the watchdog anymore with my change.

I'll try with more users and I'll report back!

windy54 commented 1 year ago

I will confirm I am using 557 and get back to you , might be the weekend or early next week. The bin file I definitely used 557, not sure where I loaded the source from.CheersSteve GaleOn 25 Jan 2023, at 13:55, Matteo Giaccone @.***> wrote: Hey Steve, thank you for helping out. Just to double check, are you using the code from the PR #577, not dev, right? Then, can you please paste the stacktrace and what you see in the logs before the reboot? From what you are sharing looks like the watchdog reset, maybe I need to try with more users to try and replicate what you see. I couldn't hit the watchdog anymore with my change. I'll try with more users and I'll report back!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

windy54 commented 1 year ago

I have downloaded source code from dev branch fix_websockets , I could not see any code under 557, only the binaries which I originally used. ` ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset load 0x4010f000, len 1392, room 16 tail 0 chksum 0xd0 csum 0xd0 v3d128e5c ~ld

[ INFO ] ESP RFID v2.0.0-dev Flash real id: 001640E0 Flash real size: 4194304

Flash ide size: 4194304 Flash ide speed: 40000000 Flash ide mode: DIO Flash Chip configuration ok. `

So I am getting the same error message which looks like the watch dog.

If I can spot where this is setup I will try changing it.

Steve

matjack1 commented 1 year ago

hey @windy54 sorry, but just to be extra sure, I've increased the version number in my test. Can you please get the build from here: https://github.com/esprfid/esp-rfid/actions/runs/4028030821 and try again? It should show ESP RFID v2.0.0-dev.1 as version number.

About the watchdog instead, it's the system watchdog that gets triggered, you cannot do anything about it. The problem of the websocket is that it runs in a callback, meaning that it's outside of the main loop. This causes problems when the watchdog starts because it can mess around memory and if it's outside of the loop there's no guarantee about what happens. That's why sometimes it breaks and the stacktrace is always different.

My solution of moving the logic from the callback to the main loop should fix this problem, I think it's the only way to solve the issue properly. If there's still a problem it might be something else that I haven't catched yet.

Let me know if this build still breaks and please keep sharing the logs as you've done before, it's very helpful. Possibly share also the full stacktrace that you get when it breaks. Now I'm going to test with 150+ users and I'll report how it goes.

matjack1 commented 1 year ago

actually, wait, I've been able to reproduce! Thank you :) I'll try to understand more about the error and hopefully fix it :)

windy54 commented 1 year ago

I will leave you to it then🙂CheersSteve GaleOn 27 Jan 2023, at 20:36, Matteo Giaccone @.***> wrote: actually, wait, I've been able to reproduce! Thank you :) I'll try to understand more about the error and hopefully fix it :)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

matjack1 commented 1 year ago

(I've linked the wrong build before, the right one is this one: https://github.com/esprfid/esp-rfid/actions/runs/4028030821)

I think now it's not the watchdog anymore, it's something else. Which is a good news, but it means it might take a while.

matjack1 commented 1 year ago

hey @windy54 I've changed the pagination system so that it now fetches one page at the time when loading the users (still to be done for logs). Can you please check if that works? Here's the build: https://github.com/esprfid/esp-rfid/actions/runs/4033629622

I have ideas to make it more robust if you click too quickly and queue too many requests, but for now if you use it normally it should work... Hopefully! It works well for me with more users, but let me know if that works for you too!

windy54 commented 1 year ago

I will try as soon as I canCheersSteve GaleOn 28 Jan 2023, at 21:18, Matteo Giaccone @.***> wrote: hey @windy54 I've changed the pagination system so that it now fetches one page at the time when loading the users (still to be done for logs). Can you please check if that works? Here's the build: https://github.com/esprfid/esp-rfid/actions/runs/4033629622 I have ideas to make it more robust if you click too quickly and queue too many requests, but for now if you use it normally it should work... Hopefully! It works well for me with more users, but let me know if that works for you too!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

windy54 commented 1 year ago

Hi Matt,

First time it worked, second time it has locked up.

So I did the following:

1) used MQTT to delete all users

2) restored a file with 140 users and it worked.

3) used MQTT to delete all users

4) tried to restore the same file I am logging the serial output as well.

5) I have currently got the "please wait while restoring user data" box displayed.

6) On the serial output it has displayed the first line of the userdata file and stopped

7) [INFO] Mqtt Publish messages are being displayed, so far two have been output with an uptime of 546 and 727

8) i am connected to a 522 reader and it read a card, i.e. serial output "[INFO] PICC's UID .... "

9) just had another MQTT message.

10) I have clicked on the restoring data window and it has closed.

11) When I select Users it displays the one record it read in.

12) Interestingly I tried reading the same card and no output on the serial line

13) tried restoring user data again and it appears to have read in the next record, (just checked the file and this is the case)

14) still is not reading a card

15) everytime I try and restore the userfile it is reading the next record.

Just so you know I am trying this on my raspberry PI, running 64bit raspberry pi OS, from a Chromium browser.

Actually I will just try from my MAC before sending this.

Tried this on my MAC and the same problem, one record at a time using google chrome.

Also I have tried using a "normal" window and incognito window.

I think I have covered everything I can think of, if there are anymore tests you want me to do, let me know

Steve

On Sat, 28 Jan 2023 at 21:18, Matteo Giaccone @.***> wrote:

hey @windy54 https://github.com/windy54 I've changed the pagination system so that it now fetches one page at the time when loading the users (still to be done for logs). Can you please check if that works? Here's the build: https://github.com/esprfid/esp-rfid/actions/runs/4033629622

I have ideas to make it more robust if you click too quickly and queue too many requests, but for now if you use it normally it should work... Hopefully! It works well for me with more users, but let me know if that works for you too!

— Reply to this email directly, view it on GitHub https://github.com/esprfid/esp-rfid/issues/557#issuecomment-1407490309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABI2YSZZTVKTNIS4ACM72F3WUWEJ3ANCNFSM6AAAAAASF7PFLQ . You are receiving this because you were mentioned.Message ID: @.***>

matjack1 commented 1 year ago

thank you @windy54 I can reproduce some of the times :( I'll get back here as soon as I have news!

matjack1 commented 1 year ago

unfortunately I have bad news for now :( From some digging I'm pretty sure that the problem here is in the websocket implementation of ESPAsyncWebServer, a dependency of this project that has some instability issues with a somewhat heavy usage of websockets.

I've managed to import a long list of users by adding some delay between one socket message and another, but it's really annoying :( And still is not a real solution as sometimes it breaks doing something else.

Having said that, I don't know if there's a real solution with the current stack, we can mitigate the problem as much as possible, but the crashes and resets are still going to happen when using the web UI.

On a positive note, MQTT seems pretty stable instead, so I would recommend moving as much logic as possible to MQTT in order to minimise the restarts.

I'm going to ship some mitigations here and there in the near future, but still I don't think I'm going to change the library anytime soon, and the development there seems stalled.

windy54 commented 1 year ago

Hi Matteo,Thanks for the update , I was starting to get the MQTT side of things sorted.CheersSteve GaleOn 8 Feb 2023, at 21:40, Matteo Giaccone @.***> wrote: unfortunately I have bad news for now :( From some digging I'm pretty sure that the problem here is in the websocket implementation of ESPAsyncWebServer, a dependency of this project that has some instability issues with a somewhat heavy usage of websockets. I've managed to import a long list of users by adding some delay between one socket message and another, but it's really annoying :( And still is not a real solution as sometimes it breaks doing something else. Having said that, I don't know if there's a real solution with the current stack, we can mitigate the problem as much as possible, but the crashes and resets are still going to happen when using the web UI. On a positive note, MQTT seems pretty stable instead, so I would recommend moving as much logic as possible to MQTT in order to minimise the restarts. I'm going to ship some mitigations here and there in the near future, but still I don't think I'm going to change the library anytime soon, and the development there seems stalled.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Renstec commented 1 year ago

Hello,

Thanks for the cool project. I also encountered the user file restore issue. To further isolate the problem I switched to a fork of ESPAsyncWebServer, this fork contains many improvements.

After the switch I immediately noticed that the free heap increased from 31,032 bytes to 38,984 bytes. In addition, the stability of sending the user list to the WebSocket client improved. Despite these improvements, however, I encountered some limitations when trying to restore many users. Occasionally, messages sent from the ESP were not received by the socket client, causing the restore process to stall. In addition, the ESP would occasionally experience crashes, as before the change.

To solve this problem (for the user restore), I implemented a ticker that sends the WebSocket messages to retrieve the next entry outside of the asynchronous context. With this changes, I was able to successfully recover about 200 users.

tickerGetNextUserEntry.once_ms_scheduled(5,[]() {       
    ws.textAll("{\"command\":\"result\",\"resultof\":\"userfile\",\"result\": true}");
});

Please note that if you want to switch to the fork, you will need to make changes to the SPIFFSEditor.cpp file. You have to comment out line 10 and 12 to get it compiled.

//#ifdef ESP32 
 #define fullName(x) name(x)
//#endif

However, the stability has improved, but crashes do still occur. I hope this info helps.

Best regards Renstec

matjack1 commented 1 year ago

hey @Renstec thank you very much for the feedback. I've tried implementing your changes, which improved a bit, and together with my latest changes here: https://github.com/esprfid/esp-rfid/pull/577 I think I'm happy with how it works.

Now if the esp breaks the browser waits 5 seconds and then tries again sending the last message. This should help fixing long imports.

@windy54 I'm not sure if you still care about this project, but if you do and if you want to give this a try it would be very helpful! :)

Also, I've changed how the users table works. Now it fetches only one page at the time, not the full list of users, making the table a lot faster if you have a lot of users. Try that as well, it's a bit hacky, but it should work good enough.

windy54 commented 1 year ago

Hi,Yes I am interested, I need to try it out, other projects have taken over at the moment.CheersSteve GaleOn 26 Aug 2023, at 21:46, Matteo Giaccone @.***> wrote: hey @Renstec thank you very much for the feedback. I've tried implementing your changes, which improved a bit, and together with my latest changes here: #577 I think I'm happy with how it works. Now if the esp breaks the browser waits 5 seconds and then tries again sending the last message. This should help fixing long imports. @windy54 I'm not sure if you still care about this project, but if you do and if you want to give this a try it would be very helpful! :) Also, I've changed how the users table works. Now it fetches only one page at the time, not the full list of users, making the table a lot faster if you have a lot of users. Try that as well, it's a bit hacky, but it should work good enough.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

matjack1 commented 1 year ago

Excellent! No rush :)

I'm trying to release the V2 by mid-September after which I'm going to only focus on bug-fixing for a while.

This was a pretty significant effort, on which I plan to only do minor improvements if necessary. Unfortunately I think I cannot do much better with what we have at the moment :(

Renstec commented 1 year ago

Hello,

I wanted to make a suggestion about the challenges we are facing with WebSockets stability. Instead of constantly trying to work around the WebSocket server errors, have you thought about switching to Server-Sent Events (SSE) as well as to the fetch API?

Using SSE and fetch could potentially provide a more reliable solution to completely avoid the crashes caused by Websocksock communication.

It might be worth investigating this switch further.

Thanks Renstec

matjack1 commented 1 year ago

Hey @Renstec if I had to build this from scratch, for sure I would not use websockets for everything. Moreover I would not use EspAsyncWebServer in general, as the problem is with this library simply breaking under moderate usage. The less free memory you have the easier it breaks, so that's why it started becoming a bigger problem recently after having added more functionality.

If my latest PR: https://github.com/esprfid/esp-rfid/pull/577 works well enough, I'm going to stop there and issue only bugfixes for this project after having released V2.

If the release goes well and there some interest I'm willing to spend some time re-implementing everything for ESP32, since this project is not worth porting. Too much work and too many existing issues that are difficult to solve.

If you can test the new PR it would be great! Thank you :)

matjack1 commented 10 months ago

Hey @windy54 I have merged in dev my work to improve stability for the websockets.

It's a bit better, not massively, but I think it's better than before.

In any case, use MQTT to import/export users, it's far more stable and I think you end up wasting less time.

I'm closing this for now as I think there's not much else that I can do with the current set of libraries and with ESP8266 :)

windy54 commented 10 months ago

Thanks, I have still got to test the latest releases.Our system has been stable for a while, do not know why, until last week when I had trouble adding new users.I will get m6 MQTT system set up and test it,CheersSteve GaleOn 25 Oct 2023, at 21:48, Matteo Giaccone @.***> wrote: Hey @windy54 I have merged in dev my work to improve stability for the websockets. It's a bit better, not massively, but I think it's better than before. In any case, use MQTT to import/export users, it's far more stable and I think you end up wasting less time. I'm closing this for now as I think there's not much else that I can do with the current set of libraries and with ESP8266 :)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>