Open helpmyRF24isntworking opened 1 month ago
Could also be that the processes were killed silently or simply not resumed since they yielded, however this does not happen at any other spot that yields / uses os.pullEvent except for those loops.
Cc does not have concept of processes so that's not likely what is happening. But this does look like event queue might be getting overloaded and timer events get discarded leading to rednet.lookup hanging forever? it would have similar effect on gps.locate if world had only 4 hosts and one of responses got discarded?
rednet.host is most likely just echo effect of rednet.lookup because it preforms lookup when hosting (to ensure its not duplicating existing hosted name).
Tho i am highly suspicious of this bluenet you are showcasing here. Has any of those issues happened on just rednet?
EDIT: I found your bluenet code, at a glance it does not have anything that would cause event queue to be overloaded by itself.. It might still be natural event queue overload, which would have symptoms like your described.
If you are flooding the event queue, getting the turtles to randomly delay when they start might help.
Yep, if you have massive amount of modem and rednet hosting in the world, queue excess is possible. If you want control swarm of computer, you can try use websocket, where you can compress multiple message in a single event.
Cc does not have concept of processes so that's not likely what is happening. But this does look like event queue might be getting overloaded and timer events get discarded leading to rednet.lookup hanging forever? it would have similar effect on gps.locate if world had only 4 hosts and one of responses got discarded?
rednet.host is most likely just echo effect of rednet.lookup because it preforms lookup when hosting (to ensure its not duplicating existing hosted name).
~Tho i am highly suspicious of this bluenet you are showcasing here. Has any of those issues happened on just rednet?~ EDIT: I found your bluenet code, at a glance it does not have anything that would cause event queue to be overloaded by itself.. It might still be natural event queue overload, which would have symptoms like your described. @Wojbie
Yeah I just dumped the folder from my host turtle, nothing pretty. The main logic for bluenet is in classBluenetNode. bluenet.lua itself is basically just for opening and closing the channels as a global api. Lookup and host is the original rednet implementation. The actual messaging is handled by the nodes to reduce function calls to the global api.
I have 60+ Turtles in my Testworld but i made sure that startup happens in steps, no delay tough. During startup no new events/modem_messages are queued without the host being online in the first place. Except for the dns/gps messages of course. Those turtles fail before the host even boots.
If you are flooding the event queue, getting the turtles to randomly delay when they start might help. @Lupus590
True, but also quite annoying. Idk about the limitations of the queue but i cant imagine it being so low, not even 60 computers at startup are supported. It also only happens during lag which in my mind only slows down the code, not actually increase the events being queued dramatically.
The limit of queue is 256, however this is limit is per computer, which means you should be aware if you are using parallel to queue tasks at same time.
If can are only have 60 computers at time, means you have 4 parallel tasks are queue a event at same time.
A good practice is try optimize multiple sleep(0)
to a global tick
event
Yep, if you have massive amount of modem and rednet hosting in the world, queue excess is possible. If you want control swarm of computer, you can try use websocket, where you can compress multiple message in a single event. @zyxkad
Its "only" 60 Turtles, so i wouldnt say thats massive but a considerable amount. As mentioned in my previous reply idk enough about the java side of the queue implementation to give any usable reply to this.
Is it possible to use the websockets for direct turtle communication without using http? In classBluenetNode I already implemented my own ingame websocket via the "stream", which just does a basic websocket handshake and combines multiple messages into one singular stream message.
The limit of queue is 256, however this is limit is per computer, which means you should be aware if you are using parallel to queue tasks at same time. If can are only have 60 computers at time, means you have 4 parallel tasks are queue a event at same time. A good practice is try optimize multiple
sleep(0)
to a globaltick
event @zyxkad
Oh, 256 is lower than expected. That might explain it but lets say I have 256 Turtles, wouldnt that automatically make rednet.lookup fail each time they start? My goal for this project was to support as many turtles as possible but for my testworld i stuck to just 60.
I have some sleep(0) statements in the other code like pathfinding etc., not during startup though. If i remember correctly, sleep also queues a new timer event, which is why you recommend using a global tick? Could I alternatively just do coroutine.yield if i want to avoid the timeout?
coroutine.yield cannot be used to replace sleep(0). because it may not wait for a single tick, or it may wait forever (until an event is queued). sleep(0) queue a timer event that will be fired after exactly one tick.
So if I understand correctly the queue is being flooded with more than 256 events.
According to my current setup: nTurtles = amount of turtles/computers (60) nGps = amount of gps hosts (4) nHosts = amount of "hosts" (1-2)
For each request:
If those messages are not handled/pulled in time and the queue gets above 256, the initial timer event gets discarded. Thats the main issue.
Somewhat unrelated to the problem itself, I decided to get rid of rednet entirely. This is mainly to avoid a duplicate "while true do os.pullEvent()" loop and in turn a duplicate check of the modem_messages. Looking up just the host should be done via a seperate channel, only the hosts have opened at all times. In a worst case scenario this results in nHosts responses per request. A seperate ping/lookup for all computers can be implemented via the host or any other computer as a message broker. The gps issue persists for now but should be mitigated by the reduced amount of events.
It took me a while to figure out how to manipulate the bios but here´s how to kill rednet.run: (perhaps this is also possible via debug if the coroutine id is known)
local function biosWithoutRednet()
-- run the shell without rednet
os.unloadAPI("rednet")
local ok, err = pcall(
function()
local sShell
if term.isColour() and settings.get("bios.use_multishell") then
sShell = "rom/programs/advanced/multishell.lua"
else
sShell = "rom/programs/shell.lua"
end
os.run({}, sShell)
os.run({}, "rom/programs/shutdown.lua")
end
)
-- [...]error handling shutdown etc. see bios.lua
end
setfenv(biosWithoutRednet, _G)
-- trigger pullEventRaw in rednet.run to fail
-- ignore the error and go to os.shutdown
local originalError = _G.error
local originalShutdown = _G.os.shutdown
local originalPullEventRaw = _G.os.pullEventRaw
_G.error = function() end
_G.os.pullEventRaw = nil
_G.os.shutdown = function()
-- intercept shutdown and restore functions
_G.error = originalError
_G.os.pullEventRaw = originalPullEventRaw
_G.os.shutdown = originalShutdown
-- start the shell again, without rednet
return biosWithoutRednet()
end
startup.lua:
if rednet then
shell.run("runtime/killRednet.lua")
return
end
Another quirk with rednet.receive / using the timer as an exit condition:
If timers are created too fast, the event queue gets flooded. Quite obvious in hindsight. E.g. im sending and receiving 10.000 messages per second. With each receive call, another timer is being created. But timer events can only be processed in 50ms intervals. So once more than 256 Timers get created within the 50ms interval, the queue automatically gets flooded by all those timers in the tick they are due. This doesnt lead to any infinite loops or something else dealbreaking but can result in lost messages.
-- max. processed timers each tick is 255
In my own implementation I now avoid this by keeping track of the last time/tick a timer has been created. By cancelling outdated timers, other coroutines are also not unneccessarily resumed after listening for a message. Since rednet is loaded globally, I assume the timers should also be distinguished by protocol/coroutine.
local timerClocks = {}
local timers = {}
function receive(protocol, waitTime)
local timer = nil
local eventFilter = nil
if waitTime then
local t = os.clock()
if timerClocks[waitTime] ~= t then
-- cancel the previous timer and create a new one
os.cancelTimer((timers[waitTime] or 0))
timer = os.startTimer(waitTime)
timerClocks[waitTime] = t
timers[waitTime] = timer
else
timer = timers[waitTime]
end
eventFilter = nil
else
eventFilter = "modem_message"
end
Probably not that important but for high volume messaging, this causes a lot of overhead.
I actually have a coroutine library to allow you create infinity timers by iter ticks as I mentioned above https://github.com/zyxkad/cc/blob/master/coroutinex.lua#L488
With each receive call, another timer is being created
Note that this should only be the case when a timeout is passed. If you're receiving 10k messages a second, you probably don't need a timeout in the first place!
Though yes, we should cancel the timer if it expires.
I actually have a coroutine library to allow you create infinity timers by iter ticks as I mentioned above https://github.com/zyxkad/cc/blob/master/coroutinex.lua#L488 @zyxkad
Very interesting, ill look into it in more detail in the next days. Thanks!
Note that this should only be the case when a timeout is passed. If you're receiving 10k messages a second, you probably don't need a timeout in the first place!
Though yes, we should cancel the timer if it expires. @SquidDev
Not in testing, but i need the timout to trigger onNoAnswer events in "real" applications. While i send lots of messages, they all come from different clients and require ACKs (like MQTT QOS 1). If one client/broker has an issue or lags behind, this must result an according reaction for that specific message. If i dont check for missed messages via timers, a client might get "stuck" without being able to resolve the issue. Once the timer runs out, the clients have to republish their accumulated logs and are only allowed to discard them after they received an ack.
The 10k messages are quite excessive for testing purposes but still highlight the overhead caused by the timers.
Minecraft Version
1.21.x
Version
1.21.1-fabric-1.113.1
Details
During high lag situations, especially when loading the world, rednet.lookup does not return / resume. I observed this behaviour in gps.locate and rednet.host as well. Since rednet.host uses rednet.lookup this is somewhat plausible. Because gps.locate uses a very similar loop to rednet.lookup, I assume the issue lies with the timer (os.startTimer) event being used as an exit condition. I dont know if it is possible for a process to miss a timer event, however this looks like the only commonality between those three scenarios.
Except for rednet, no other process is using os.pullEvent in the cases i encountered this issue (except for the bios i guess). This usually happens after repeatedly calling rednet.lookup to wait for the host to come online. (would be nice to manually set the timeout for rednet.lookup btw.)
I only encountered this when loading the world, which results in my turtles being stuck because they call gps.locate, rednet.host and rednet.lookup on startup. The only fix is to manually restart the turtles or preload the world into memory and join again. Quickly reloading the shaders using "R" helps with reproducing this. No error message is being displayed and even after 30+ minutes the turtles do not recover.
Screenhots:
rednet.lookup using parallel.waitForAll 2 processes
forced parallel lookups all locking at once instead of just one
rednet.lookup singular main process 6th call
rednet.host parallel.waitForAll 2 processes (bluenet is a more lightweight implementation of rednet but uses the default rednet.lookup and rednet.host because of its response mechanism being started in the bios)
gps.locate singular main process, 2nd call
UPDATE: 18.10.2024
After removing rednet entirely, the only bug i still couldnt get rid of, was the one seen in the last screenshot about gps.locate. It seems this is actually an issue with turtle.forward. During stresstesting i found that the first turtle.forward statement does not return or throw any errors. Since it was right before gps.locate, i initially assumed the issue lies with the gps.
turtle.forward indefinitely locking up
This only happens when immediately calling turtle.forward after opening three shells. I got rid of the bug by opening the shells after initializing the orientation of my turtle. Turtle.forward does get stuck but the turtle itself did in fact move a block forward and was not blocked.