SignalK / signalk-server

An implementation of a Signal K central server for boats.
http://signalk.org
Apache License 2.0
322 stars 156 forks source link

CAN0 failure with RPI4b and Pican-M hat caused by signalk-rpi-monitor at low intervals #1626

Open naugehyde opened 1 year ago

naugehyde commented 1 year ago

Previously reported that CAN0 randomly stops streaming data (via canboatjs) from a Pican-M hat on a RPI4b connected to a N2k network. [#1622]

Okay so after a LOT of investigation I've been able to isolate the problem and the bug is (mostly) recreatable.

The problem stems from the signalk-rpi-monitor interval. When the monitor plugin is set to report at an interval of say 5s (the default is 60s) CAN0 is likely to fail after 20min or so. The lower the interval the more likely the failure will occur in short order.

I haven't isolated the root cause of the problem (I'm guessing that it has something to do with nodejs's intermittent garbage collection cycle) but it's a fairly serious problem as the only way to get one's instruments back is to restart the Signalk server process.

I'd bet that even at the default 60s interval the CAN bus will, in time, fail (silently). At sea, this could be problematic to say the least.

To recreate the problem you'll need a CAN0 connection of some kind on an RPI 3/4b (only tested on 4b w/Pican-M) and signalk-rpi-monitor installed and its interval set to a sufficiently low value.

For the time being, boaters with signalk-rpi-monitor installed on a RPI with a CAN0 bus should, at the very least, take warning and definitely not set their rpi-monitor interval to a value less than the default.

As for a longer term solution, further investigation into canboatjs is warranted as the failure is not occurring at the system level (candumpjs/candump continue to function normally after signalk's CAN0 data connection fails).

tkurki commented 1 year ago

Whoa. Thanks for investigating and sharing. Created https://github.com/sberl/signalk-rpi-monitor/issues/9 with link here and posted link to this on Slack.

tkurki commented 1 year ago

Maybe we close the other issue and edit the title of this one to mention CAN and signalk-rpi-monitor?

naugehyde commented 1 year ago

I'll be interested to see if others can recreate the problem on different hardware and different canboat connections. I have a waveshare CAN hat rig that I'll try later today.

In the meantime, setting the @canboat/canboatjs option noDataReceivedTimeout to an absurdly large value (60s) to see if the problem is mitigated.

I suspect the problem is only triggered by signalk-rpi-monitor (which of necessity uses the spawn() function repeatedly) but may be the inevitable result of a garbage collecting program environment.

naugehyde commented 1 year ago

One stopgap solution against any plugin (not just rpi-monitor) that's doing whatever rpi-monitor is doing and causing CAN0 to go down is to set the "noDataReceivedTimeout" suboption for the connection in your settings.json.

e.g.: "pipedProviders": [ { "enabled": true, "id": "can0", "pipeElements": [ { "options": { "logging": true, "subOptions": { "noDataReceivedTimeout":2, "interface": "can0", "type": "canbus-canboatjs", "uniqueNumber": 1159732, "useCanName": true, "filtersEnabled": false, "filters": [], "overrideTimestamp": true }, "type": "NMEA2000" }, "type": "providers/simple" } ] },

mairas commented 1 year ago

@naugehyde how is the can0 interface configured? In /etc/network/interfaces or /etc/network/interfaces.d? Could you share the configuration?

mairas commented 1 year ago

My theory: PICAN-M uses MCP2515 as the CAN controller (just like most other Pi devices out there). That IC has an RX queue of only 2 frames. Signalk-rpi-monitor spawns external processes to read the information. Even though this should be an async operation, I dunno, maybe this still causes some blocking in nodejs main loop. If this is the case, it could make the RX buffers sometimes overflow. I believe overflows cause the RX error counter to increase, and once REC goes above 127, the CAN controller enters an Error-Passive state, causing any transmissions to cease.

naugehyde commented 1 year ago

Thanks, that's interesting.

FYI rpi-monitor spawns system shells and executes system calls (mpstat etc) to get its data. It's possible one or more of the commands is blocking the CAN IP device but I doubt it as CAN0 continues to function normally after canboatjs has hiccupped (a candump open in another terminal will stream continuously). It stands to reason that candump would reflect any blockage and timeout/retry in its stream.

(As I noted earlier, setting a noDataReceivedTimeout value in the CAN0 suboptions causes canboatjs to retry after hiccupping and the problem is minimized as far as SK is concerned.)

But yeah it remains a mystery why canboatjs is hiccupping at all. The nodejs event loop should not be blocked by spawned system calls. If it is the case that the CAN device buffer is overflowing then canboatjs needs to handle it more gracefully. But that still doesn't answer the question of what rpi-monitor is doing to cause the problem in the first place!

On Mon, Sep 18, 2023 at 12:01 AM Matti Airas @.***> wrote:

My theory: PICAN-M uses MCP2515 as the CAN controller (just like most other Pi devices out there). That IC has an RX queue of only 2 frames. Signalk-rpi-monitor spawns external processes to read the information. Even though this should be an async operation, I dunno, maybe this still causes some blocking in nodejs main loop. If this is the case, it could make the RX buffers sometimes overflow. I believe overflows cause the RX error counter to increase, and once REC goes above 127, the CAN controller enters an Error-Passive state, causing any transmissions to cease.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1722850394, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRGDNVY6FDKS4FZP2BLX27WTFANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

mairas commented 1 year ago

@naugehyde could you please provide can0 configuration settings I requested above?

mairas commented 1 year ago

Note that IF the issue is due to the RX buffers overflowing, canboatjs has nothing to do with it and it can do nothing about it.

naugehyde commented 1 year ago

The config is the vanilla config prescribed by the manufacturer:

auto can0 iface can0 inet manual pre-up /sbin/ip link set can0 type can bitrate 250000 up /sbin/ifconfig can0 up down /sbin/ifconfig can0 down

NB: I've recreated the issue on a rig with a different mcp2515 can hat ( https://www.waveshare.com/rs485-can-hat.htm).

Not getting reports of buffer overflows from the driver-- as I stated earlier, the HW continues to function normally (if candump/candumpjs are to be believed). Also, it's only canboatjs that's timing out and failing. The event loop continues to stream from other sources after canboatjs stops reporting CAN0 data. And FWIW, canboatjs, when noDataReceivedTimeout is set, will report receiving no data then retry and function normally after a brief delay.

On Mon, Sep 18, 2023 at 3:41 AM Matti Airas @.***> wrote:

Note that IF the issue is due to the RX buffers overflowing, canboatjs has nothing to do with it and it can do nothing about it.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1723156664, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRB742EQEH57BJTHTQTX3AQN3ANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

mairas commented 1 year ago

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

naugehyde commented 1 year ago

Excellent! Thank you. Will give this a go ASAP and let you know.

On Mon, Sep 18, 2023 at 7:29 AM Matti Airas @.***> wrote:

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1723544308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRB22QFJU7N2BMV2LILX3BLFDANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

naugehyde commented 1 year ago

FYI the bug is easily recreated by setting rpi-monitor's interval to 1s. I'm not near my boat today and can only connect to it over 2G so it may take me a day or so to get to testing. If you have access to a N2K network atop an RPI with a mcp2515 controller, I'd love to know what you discover.

On Mon, Sep 18, 2023 at 8:04 AM Andrew Gerngross @.***> wrote:

Excellent! Thank you. Will give this a go ASAP and let you know.

On Mon, Sep 18, 2023 at 7:29 AM Matti Airas @.***> wrote:

I recommend adding "restart-ms 100" to the pre-up line. That makes the kernel reset the CAN controller after 100 ms if the controller enters bus-off state. That might not apply in this case because bus-off kills both RX and TX, and you said RX still works. But it's a good in any case.

OK, and finally found:

If you can get TX to stop again (the error state you reported), try the following command (before applying the above change):

ip -details -statistics link show can0

What does it report as state? Normally it should be ERROR-ACTIVE. ERROR-PASSIVE means that the controller is in passive state; TX is disabled but RX still works. And BUS-OFF is completely dead.

With this, we can decisively narrow down the issue; either to the CAN controller & kernel driver or userspace.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1723544308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRB22QFJU7N2BMV2LILX3BLFDANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

naugehyde commented 1 year ago

Before CAN0 failure:

4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10 link/can promiscuity 0 minmtu 0 maxmtu 0 can state ERROR-ACTIVE restart-ms 0 bitrate 250000 sample-point 0.875 tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1 clock 8000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped missed mcast 52248708 6531116 0 1233 0 0 TX: bytes packets errors dropped carrier collsns 14640232 1830069 0 0 0 0

After:

4: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10 link/can promiscuity 0 minmtu 0 maxmtu 0 can state ERROR-ACTIVE restart-ms 0 bitrate 250000 sample-point 0.875 tq 250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 mcp251x: tseg1 3..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1 clock 8000000 re-started bus-errors arbit-lost error-warn error-pass bus-off 0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 RX: bytes packets errors dropped missed mcast 52811163 6601425 0 1233 0 0 TX: bytes packets errors dropped carrier collsns 14767240 1845950 0 0 0 0

No relevant change that I can see.

mairas commented 1 year ago

OK, interface is fine both before and after. And it's not about dropped packets because the dropped count didn't change, and error count remains zero. So it is a userspace issue!

naugehyde commented 1 year ago

Yes. Something is stalling the RX at the canboatjs level (socketcan functions normally throughout) and causing a timeout that fails silently. The offending plugin that triggered all of this (rpi-monitor) isn't making any blocking calls that I can see (Node::spawn() for one is documented as non-blocking), so I am mystified as to the root cause.

FWIW other users have reported canboatjs stream failures ( https://github.com/canboat/canboatjs/issues/196 and https://github.com/canboat/canboatjs/issues/193) but I can't speak to the details of their configurations.

My next test is to create a plugin that mimics rpi-monitor spawn'ed system calls and see if there's an offending system call at the root of this or if it is in fact Node::spawn() somehow messing things up.

In the meantime, SK users are advised to add the noDataReceivedTimeout option to canboatjs as I detailed earlier in this thread ( https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1722595149 ).

On Tue, Sep 19, 2023 at 8:07 AM Matti Airas @.***> wrote:

OK, interface is fine both before and after. And it's not about dropped packets because the dropped count didn't change, and error count remains zero. So it is a userspace issue!

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1725835519, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRG5TIFL7U7EMC2YTSDX3GYKRANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

tkurki commented 1 year ago

@naugehyde 👏 for investigating this! Let's concentrate communications here for the time being, as we dont' know what combination triggers the problem.

I think the best way forward would be to create as low level repro for this as we can, starting with just socketcan and spawn. Could you try something like this?

naugehyde commented 1 year ago

Thanks for the code. Will put it to use soon.

If it's a resource issue owing to the overhead of frequent shell creation, Nodejs's garbage collection cycle could be running long enough for the CAN0 nodejs stream to programmatically timeout . If this is true (will test inside SK first) running the test outside of SK will change the running memory footprint so I may have to adjust the V8 heap size to invoke the timeout.

mpvader commented 1 year ago

Hey all, possibly off topic, but did want to share here:

that Pican-M is using the MCP2515, which has only two RX buffers.

That puts quite the requirement on speedy processing by the main CPU.

When designing hardware on our end, and needing an external CAN controller (as opposed to one inside a microcontroller, which almost always has more RX buffers than only 2); we quite quickly discarded the MCP2515 as an option.

And worked on/with the MCP2518 or 2519 instead. Much more buffers. And in case of N2K, especially the existence of fastpackets adds to the requirement of having sufficient RX buffers.

Supported in Linux kernel 5.10 and up if I’m not mistaken.

in other words: on my boat, I’d try to keep it free from MCP2515. Missing packets leads to the weirdest of issues, very hard to diagnose.

mairas commented 1 year ago

I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge. My boat server, based on a Raspberry Pi CM4 and Waveshare's isolated CAN HAT (with MCP2515), at the moment has been running for 8 days straight. CAN RX packet counter is at 82292123 (627.7 MiB) and the dropped packets counter is at 5227. That's one dropped packet every 2 minutes. Not nothing, but the packet loss rate is 5227/82292123, or 0.006%. Given the repetitive nature of N2K traffic, that's not a huge deal. But this is something where YMMV a lot.

naugehyde commented 1 year ago

I agree that the RX buffers on the MCP2515 may be insufficient for some scenarios. In a large harbor like mine (Marina Del Rey) the N2K stream is flooded with AIS position updates.) It doesn't explain why the canboatjs fails silently as explained earlier in the thread but a more robust CAN receiver could only help matters.

I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi.

Do you have any suggestions, @mpvader? I found this product but it's not clear from the docs how to interface it with an RPi.

mpvader commented 1 year ago

I would happily test an MCP2518 with Signalk, if I could find a board for less than $20 that easily interfaced with an RPi. Do you have any suggestions, @mpvader?

Sorry, I wouldn't know. I checked a bit, and can't find any for that price. There are some, under EUR 100. It looks like the low cost makers have not switched over (yet).

I don't argue about MCP2515 being less than stellar, but the impact isn't necessarily that huge.

Yes I agree completely on the YMMV and I can see that it will work in most cases. It will depend on business of the canbus vs CPU idle time. Note that not all can controllers have proper implementation of error and drop counters. I don't remember the details around that though. And while at it, the adapter/protocol you really don't want for a complex protocol like NMEA2000 is SLCAN.

Btw. we're looking at also from a view of having the network 100% reliable for changing configuration, and so forth. Not just repetitive data.

mpvader commented 1 year ago

To add: an implementation that we also tested extensively, and found to be very reliable, is the candlelight project, which uses the gsusb driver between the CAN-CPU and the Linux-CPU.

sailoog commented 1 year ago

Hi, our MacArthur HAT v1.2 is about to be released as stable and it has the MCP2518 tested and working: https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1

All the prototypes are installed and working and I only have one on hand now, but we will soon go into production. Maybe we can leave you a prototype temporarily, we are in Barcelona and Washington, where are you?

naugehyde commented 1 year ago

I'd happily test a prototype if I had one.

I am in Los Angeles as it happens.

On Mon, Sep 25, 2023 at 3:12 AM Sailoog @.***> wrote:

Hi, our MacArthur HAT v1.2 is about to be released as stable and it has the MCP2518 tested and working: https://github.com/OpenMarine/MacArthur-HAT/tree/v1.1

All the prototypes are installed and working and I only have one on hand now, but we will soon go into production. Maybe we can leave you a prototype temporarily, we are in Barcelona and Washington, where are you?

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1733367360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRFZDVGLBHODRHVWW53X4FKKLANCNFSM6AAAAAA43WHRS4 . You are receiving this because you were mentioned.Message ID: @.***>

astuder commented 1 year ago

I'd happily test a prototype if I had one. I am in Los Angeles as it happens.

I can hook you up with a MacArthur HAT with the MCP2518. Drop me an email to adrian at adrianstuder dot com

tkurki commented 1 year ago

@naugehyde any news on this? Have you had the chance to test the minimal repro we talked about above?

naugehyde commented 1 year ago

@tkurki Currently testing with a MCP2518 equipped CAN hat courtesy @astuder. The MCP2518 has a much larger buffer space than the MCP2515 that's on the PICAN-M and the Waveshare CAN hat.

Just got the rig up and running today. Was not able to recreate with the MCP2518 hat but it's just day one and only had time for half an hour's testing.

Will see if I can recreate w/ the MCP2518 then get back to testing on a MCP2515 equipped hat.

It will be very interesting if I cannot recreate the bug on a MCP2518 equipped hat as the common denominator in my tests on two different boards (RPI 4b and a Le Potato) with two different CAN hats was the MCP2515 controller.

naugehyde commented 1 year ago

@tkurki Should I test on 2.3.0 or 2.2.0? I note with interest that 2.3 has an upgraded canboatjs.

tkurki commented 1 year ago

Latest makes most sense I guess.

naugehyde commented 1 year ago

@tkurki Tested the code you supplied (with modifications to increase the spawn rate) on an RPI-4b running Bullseye 64 bit and, alas, after about 12 minutes, the SOCKETCAN connection silently failed. It's worth noting that the CAN connection running on the same unit on a separate node process (with RPI-Monitor running at 30s intervals) at the same time did NOT fail.

I also tested CAN Connection (in SK) using a prototype MacArthur Hat that incorporates the MCP2518 chip (unlike the MCP2515 on the Pican-M hat it has CAN-FD capability and a robust buffer space) and that rig failed much like the Pican-M hat and a Waveshare CAN hat did in earlier tests.

Recommend that new releases provide a default of 2s to the noDataReceivedTimeout subOption for CAN connections (see: this)..

Still not sure why this is happening. Still not sure what (if anything) besides repeated spawn() calls will cause a CAN0 connection to silently fail.

tkurki commented 1 year ago

@naugehyde just to be crystal clear:

What I am after is a reproducible bug report at https://github.com/sebi2k1/node-can. If I understand your comment correctly this has actually nothing to do with SK Server nor canboatjs (that uses socketcan).

Naturally you are very welcome to file it there yourself and link back here.

tkurki commented 1 year ago

Oh forgot to add thanks for doing the leg work on this!

naugehyde commented 1 year ago

@tkurki my bad! Tested with SOCKETCAN not canboat. Used your code, just added additional intervals and removed the reconnect in the error code (never reached by the way).

OS: Linux piriot 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux

Node version: v18.17.0

NB: Tested overnight and wrote up comment first thing prior to any caffeine.

naugehyde commented 1 year ago

@tkurki Just tested node 20. SOCKETCAN failed silently after about an 1/2 hour on node 20.6.1

NB: I am testing with AIS on in a busy harbor.

naugehyde commented 1 year ago

@tkurki Check out: https://github.com/nodejs/node/issues/14917

Wonder if the issue above can be reproduced in more recent node versions and if so, if it is related to the CAN failure we're seeing.

naugehyde commented 1 year ago

@tkurki any luck recreating the issue on your end?

humppafreak commented 1 year ago

looks like i'm running into the same issue, or at least a very similar one:

at irregular intervals (anywhere from a few minutes to >24h), the can0 interface simply "goes silent" without any hint of an error in any logfile. rx and tx packet counters in /proc/net/dev do not increase anymore, it's like someone simply pulled the plug on the interface. in most cases, the can0 interface can be forced to work again by a simple ifconfig can0 down; sleep 5; ifconfig can0 up. without the "sleep 5" in the restart commands, I get SIOCSIFFLAGS: Connection timed out on stdout and mcp251x spi0.0: MCP251x didn't enter in config mode in dmesg.

my current workaround is a simple shell script that monitors /proc/net/dev for changes in rx packet count, and does the ifconfig down/up dance as described above.

naugehyde commented 1 year ago

Fascinating but not entirely parallel to the problem I isolated. Notably, the issue I identified only affected the current Nodejs process and not the underlying device. IOW, I could bring down the CAN0 connection in Signalk with repeated spawn() calls while another process's CAN0 connection continued unabated. Sounds like that would not be the case for you. If I misunderstood, please let me know.

Also, your hat stack sounds like it has multiple CAN interfaces (Waveshare can and Sailor Hat). Is that correct? If so, why? Just curious...

On Tue, Oct 24, 2023 at 3:23 AM humppafreak @.***> wrote:

looks like i'm running into the same issue, or at least a very similar one:

  • rpi4, 8GB, running raspbian 11.7 bullseye 64bit
  • hat-stack consists of waveshare 2ch can hat, waveshare 2ch rs485 hat, waveshare GNSS hat (with the interrupt pin snipped off), and @mairas https://github.com/mairas' sailor-hat v2.
  • signalk v. 2.2.0
  • signalk-rpi-monitor plugin is installed, but not enabled

at irregular intervals (anywhere from a few minutes to >24h), the can0 interface simply "goes silent" without any hint of an error in any logfile. rx and tx packet counters in /proc/net/dev do not increase anymore, it's like someone simply pulled the plug on the interface. in most cases, the can0 interface can be forced to work again by a simple ifconfig can0 down; sleep 5; ifconfig can0 up. without the "sleep 5" in the restart commands, I get SIOCSIFFLAGS: Connection timed out on stdout and mcp251x spi0.0: MCP251x didn't enter in config mode in dmesg.

my current workaround is a simple shell script that monitors /proc/net/dev for changes in rx packet count, and does the ifconfig down/up dance as described above.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1776934543, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRGV27Y2OKEGE4MS573YA6JJVAVCNFSM6AAAAAA43WHRS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWHEZTINJUGM . You are receiving this because you were mentioned.Message ID: @.***>

humppafreak commented 1 year ago

Ah, okay. In that case, I might have stumbled upon a completely different CAN issue... which, interestingly enough, seems to have disappeared completely since I appended trace_event=spi:* to my /boot/cmdline.txt in an attempt to debug what's going on on the SPI bus.

And to your other question: The Sailor-HAT v2 does not have an N2K interface (the v1 did..). The SH v2 is more or less just a "smart power supply" with a wide input voltage range (9-32V), a stabilized 5V output, and a few beefy supercaps to buy enough time for a clean shutdown on power failure / switch off.

JonyIvy commented 1 year ago

I've had this happening several times last summer - no more N2K data shown in SK / KIP. I didn't investigate further, just rebooted the machine, so I can't tell which one of the errors I had.

CM4, Waveshare CM4 IO BASE B, Waveshare isolated dual CAN HAT (MCP2515). signalk-rpi-monitor is running with 5s refresh.

Unfortunately I cannot test anything anytime soon, RPi sitting on my desk, boat far away.

naugehyde commented 1 year ago

So it isn't just me!

Yeah the most reliable solution is earlier in this thread: just add a no data received timeout option. See: https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1722595149

If you ask me, a valid timeout value should be the default for all CAN socket connections.

On Wed, Nov 1, 2023, 11:46 AM JonyIvy @.***> wrote:

I've had this happening several times last summer - no more N2K data shown in SK / KIP. I didn't investigate further, just rebooted the machine, so I can't tell which one of the errors I had.

CM4, Waveshare CM4 IO BASE B, Waveshare isolated dual CAN HAT (MCP2515). signalk-rpi-monitor is running with 5s refresh.

Unfortunately I cannot test anything anytime soon, RPi sitting on my desk, boat far away.

— Reply to this email directly, view it on GitHub https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1789473523, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTAIRCUASQQT66DLCICJXLYCKKIZAVCNFSM6AAAAAA43WHRS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZGQ3TGNJSGM . You are receiving this because you were mentioned.Message ID: @.***>

JonyIvy commented 1 year ago

I already have noDataReceivedTimeout: 4 in my settings.json. So my problem might be the same as humppafreak's.

humppafreak commented 1 year ago

I already have noDataReceivedTimeout: 4 in my settings.json. So my problem might be the same as humppafreak's.

in that case, you might want to try enabling SPI event tracing as described in my last comment.

tkurki commented 10 months ago

@naugehyde I believe you had some new information about this?

naugehyde commented 10 months ago

@tkurki I do. The testing process is very slow, (requires a NMEA network and the fault can take up to 8 hours to reproduce), so please forgive any lapses in communication on my part.

What I know now is: at least for my NMEA network the fault occurs when RPI-Monitor (or its equivalent) refresh is set for 5s 2.5s 1.25s, .625s etc. Remarkably, the fault does NOT occur at values inbetween the reported values. Ran the test at .411s for example which equated to maybe 50 deltas/s and at least after eight hours, CAN0 stayed up. I have no idea why this is the case. Very likely a timing issue of some kind. Very likely a problem deep inside nodejs.

The fault occurs in the latest build with the latest SK and RPI system software after considerably more elapsed time than earlier builds and systems, but remains catastrophic: CAN0 dies without a warning.

It's unlikely to occur in most setups and may not occur at all on other NMEA networks (I have no way of knowing for sure. I have take devices off the network lowering the number of deltas and the fault is still reproducible but I don't have access to other networks.)

The simple solution is to put a no data rec'd timeout into the settings.json as illustrated here (https://github.com/SignalK/signalk-server/issues/1626#issuecomment-1722595149). As you know, a no data timeout will cause canboatjs to pause for the indicated amount of time in the event no data is rec'd and then restart the CAN0 connection. From my testing this will successfully keep the connection up.

It may be an obscure bug but it is real and it is catastrophic as the connection dies silently and a SK restart is required to repair.

I believe it's wise to have a NoDataReceivedTimeout default value >0 for CAN0 connections in the settings.json of the next SK version.