rednet.lookup, rednet.host and gps.locate infinite loop | turtle.forward does not return

helpmyRF24isntworking commented 1 month ago

Minecraft Version

1.21.x

Version

1.21.1-fabric-1.113.1

Details

During high lag situations, especially when loading the world, rednet.lookup does not return / resume. I observed this behaviour in gps.locate and rednet.host as well. Since rednet.host uses rednet.lookup this is somewhat plausible. Because gps.locate uses a very similar loop to rednet.lookup, I assume the issue lies with the timer (os.startTimer) event being used as an exit condition. I dont know if it is possible for a process to miss a timer event, however this looks like the only commonality between those three scenarios.

Except for rednet, no other process is using os.pullEvent in the cases i encountered this issue (except for the bios i guess). This usually happens after repeatedly calling rednet.lookup to wait for the host to come online. (would be nice to manually set the timeout for rednet.lookup btw.)

I only encountered this when loading the world, which results in my turtles being stuck because they call gps.locate, rednet.host and rednet.lookup on startup. The only fix is to manually restart the turtles or preload the world into memory and join again. Quickly reloading the shaders using "R" helps with reproducing this. No error message is being displayed and even after 30+ minutes the turtles do not recover.

Screenhots:

rednet.lookup using parallel.waitForAll 2 processes rednet_lookup_stuck

rednet_lookup_2

forced parallel lookups all locking at once instead of just one rednet_lookup_3

rednet.lookup singular main process 6th call rednet_lookup_4

rednet.host parallel.waitForAll 2 processes (bluenet is a more lightweight implementation of rednet but uses the default rednet.lookup and rednet.host because of its response mechanism being started in the bios) rednet_lookup_5

gps.locate singular main process, 2nd call grafik

UPDATE: 18.10.2024

After removing rednet entirely, the only bug i still couldnt get rid of, was the one seen in the last screenshot about gps.locate. It seems this is actually an issue with turtle.forward. During stresstesting i found that the first turtle.forward statement does not return or throw any errors. Since it was right before gps.locate, i initially assumed the issue lies with the gps.

function Miner:initOrientation()
    local newPos
    local turns = 0
    for i=1,4 do
        print(i,"forward")
        if not turtle.forward() then   <-- this breaks (also tried it outside of the if block)
            self:turnLeft()
            turns = turns + 1
        else
            newPos = vector.new(gps.locate())
            break

turtle.forward indefinitely locking up turtle_forward_pcall

This only happens when immediately calling turtle.forward after opening three shells. I got rid of the bug by opening the shells after initializing the orientation of my turtle. Turtle.forward does get stuck but the turtle itself did in fact move a block forward and was not blocked.

global.miner:initialize() <-- all good

tabMain = shell.openTab("runtime/main.lua")
tabReceive = shell.openTab("runtime/receive.lua")
tabSend = shell.openTab("runtime/send.lua")

multishell.setTitle(tabMain, "main")
multishell.setTitle(tabReceive, "receive")
multishell.setTitle(tabSend, "send")

global.miner:initialize() <-- turtle.forward bug

helpmyRF24isntworking commented 1 month ago

Could also be that the processes were killed silently or simply not resumed since they yielded, however this does not happen at any other spot that yields / uses os.pullEvent except for those loops.

Wojbie commented 1 month ago

Cc does not have concept of processes so that's not likely what is happening. But this does look like event queue might be getting overloaded and timer events get discarded leading to rednet.lookup hanging forever? it would have similar effect on gps.locate if world had only 4 hosts and one of responses got discarded?

rednet.host is most likely just echo effect of rednet.lookup because it preforms lookup when hosting (to ensure its not duplicating existing hosted name).

~~Tho i am highly suspicious of this bluenet you are showcasing here. Has any of those issues happened on just rednet?~~ EDIT: I found your bluenet code, at a glance it does not have anything that would cause event queue to be overloaded by itself.. It might still be natural event queue overload, which would have symptoms like your described.

Lupus590 commented 1 month ago

If you are flooding the event queue, getting the turtles to randomly delay when they start might help.

zyxkad commented 1 month ago

Yep, if you have massive amount of modem and rednet hosting in the world, queue excess is possible. If you want control swarm of computer, you can try use websocket, where you can compress multiple message in a single event.

helpmyRF24isntworking commented 1 month ago

Cc does not have concept of processes so that's not likely what is happening. But this does look like event queue might be getting overloaded and timer events get discarded leading to rednet.lookup hanging forever? it would have similar effect on gps.locate if world had only 4 hosts and one of responses got discarded?

rednet.host is most likely just echo effect of rednet.lookup because it preforms lookup when hosting (to ensure its not duplicating existing hosted name).

~Tho i am highly suspicious of this bluenet you are showcasing here. Has any of those issues happened on just rednet?~ EDIT: I found your bluenet code, at a glance it does not have anything that would cause event queue to be overloaded by itself.. It might still be natural event queue overload, which would have symptoms like your described. @Wojbie

Yeah I just dumped the folder from my host turtle, nothing pretty. The main logic for bluenet is in classBluenetNode. bluenet.lua itself is basically just for opening and closing the channels as a global api. Lookup and host is the original rednet implementation. The actual messaging is handled by the nodes to reduce function calls to the global api.

I have 60+ Turtles in my Testworld but i made sure that startup happens in steps, no delay tough. During startup no new events/modem_messages are queued without the host being online in the first place. Except for the dns/gps messages of course. Those turtles fail before the host even boots.

helpmyRF24isntworking commented 1 month ago

If you are flooding the event queue, getting the turtles to randomly delay when they start might help. @Lupus590

True, but also quite annoying. Idk about the limitations of the queue but i cant imagine it being so low, not even 60 computers at startup are supported. It also only happens during lag which in my mind only slows down the code, not actually increase the events being queued dramatically.

zyxkad commented 1 month ago

The limit of queue is 256, however this is limit is per computer, which means you should be aware if you are using parallel to queue tasks at same time. If can are only have 60 computers at time, means you have 4 parallel tasks are queue a event at same time. A good practice is try optimize multiple sleep(0) to a global tick event

helpmyRF24isntworking commented 1 month ago

Yep, if you have massive amount of modem and rednet hosting in the world, queue excess is possible. If you want control swarm of computer, you can try use websocket, where you can compress multiple message in a single event. @zyxkad

Its "only" 60 Turtles, so i wouldnt say thats massive but a considerable amount. As mentioned in my previous reply idk enough about the java side of the queue implementation to give any usable reply to this.

Is it possible to use the websockets for direct turtle communication without using http? In classBluenetNode I already implemented my own ingame websocket via the "stream", which just does a basic websocket handshake and combines multiple messages into one singular stream message.

helpmyRF24isntworking commented 1 month ago

The limit of queue is 256, however this is limit is per computer, which means you should be aware if you are using parallel to queue tasks at same time. If can are only have 60 computers at time, means you have 4 parallel tasks are queue a event at same time. A good practice is try optimize multiple sleep(0) to a global tick event @zyxkad

Oh, 256 is lower than expected. That might explain it but lets say I have 256 Turtles, wouldnt that automatically make rednet.lookup fail each time they start? My goal for this project was to support as many turtles as possible but for my testworld i stuck to just 60.

I have some sleep(0) statements in the other code like pathfinding etc., not during startup though. If i remember correctly, sleep also queues a new timer event, which is why you recommend using a global tick? Could I alternatively just do coroutine.yield if i want to avoid the timeout?

zyxkad commented 1 month ago

coroutine.yield cannot be used to replace sleep(0). because it may not wait for a single tick, or it may wait forever (until an event is queued). sleep(0) queue a timer event that will be fired after exactly one tick.

helpmyRF24isntworking commented 1 month ago

So if I understand correctly the queue is being flooded with more than 256 events.

According to my current setup: nTurtles = amount of turtles/computers (60) nGps = amount of gps hosts (4) nHosts = amount of "hosts" (1-2)

For each request:

gps.locate queues 1 timer event at the sender and triggers nGps responses which are received by nTurtles turtles listening to gps
rednet.lookup queues 1 timer event at the sender and triggers nTurtles-1 + nHosts responses since every computer with rednet listens to the broadcast channel
same for rednet.host

If those messages are not handled/pulled in time and the queue gets above 256, the initial timer event gets discarded. Thats the main issue.

Somewhat unrelated to the problem itself, I decided to get rid of rednet entirely. This is mainly to avoid a duplicate "while true do os.pullEvent()" loop and in turn a duplicate check of the modem_messages. Looking up just the host should be done via a seperate channel, only the hosts have opened at all times. In a worst case scenario this results in nHosts responses per request. A seperate ping/lookup for all computers can be implemented via the host or any other computer as a message broker. The gps issue persists for now but should be mitigated by the reduced amount of events.

It took me a while to figure out how to manipulate the bios but here´s how to kill rednet.run: (perhaps this is also possible via debug if the coroutine id is known)

local function biosWithoutRednet()
    -- run the shell without rednet

    os.unloadAPI("rednet")

    local ok, err = pcall(
        function()
            local sShell
            if term.isColour() and settings.get("bios.use_multishell") then
                sShell = "rom/programs/advanced/multishell.lua"
            else
                sShell = "rom/programs/shell.lua"
            end

                        os.run({}, sShell)
            os.run({}, "rom/programs/shutdown.lua")
        end
    )
    -- [...]error handling shutdown etc. see bios.lua

end

setfenv(biosWithoutRednet, _G)

-- trigger pullEventRaw in rednet.run to fail
-- ignore the error and go to os.shutdown
local originalError = _G.error
local originalShutdown = _G.os.shutdown
local originalPullEventRaw = _G.os.pullEventRaw

_G.error = function() end
_G.os.pullEventRaw = nil
_G.os.shutdown = function()
    -- intercept shutdown and restore functions
    _G.error = originalError
    _G.os.pullEventRaw = originalPullEventRaw
    _G.os.shutdown = originalShutdown
    -- start the shell again, without rednet
    return biosWithoutRednet()
end

startup.lua:

if rednet then
    shell.run("runtime/killRednet.lua")
    return
end

helpmyRF24isntworking commented 3 weeks ago

Another quirk with rednet.receive / using the timer as an exit condition:

If timers are created too fast, the event queue gets flooded. Quite obvious in hindsight. E.g. im sending and receiving 10.000 messages per second. With each receive call, another timer is being created. But timer events can only be processed in 50ms intervals. So once more than 256 Timers get created within the 50ms interval, the queue automatically gets flooded by all those timers in the tick they are due. This doesnt lead to any infinite loops or something else dealbreaking but can result in lost messages.

-- max. processed timers each tick is 255 grafik

In my own implementation I now avoid this by keeping track of the last time/tick a timer has been created. By cancelling outdated timers, other coroutines are also not unneccessarily resumed after listening for a message. Since rednet is loaded globally, I assume the timers should also be distinguished by protocol/coroutine.

local timerClocks = {}
local timers = {}

function receive(protocol, waitTime)
    local timer = nil
    local eventFilter = nil

    if waitTime then
        local t = os.clock()
        if timerClocks[waitTime] ~= t then 
            -- cancel the previous timer and create a new one
            os.cancelTimer((timers[waitTime] or 0))
            timer = os.startTimer(waitTime)
            timerClocks[waitTime] = t
            timers[waitTime] = timer
        else
            timer = timers[waitTime]
        end
        eventFilter = nil
    else
        eventFilter = "modem_message"
    end

Probably not that important but for high volume messaging, this causes a lot of overhead.

zyxkad commented 3 weeks ago

I actually have a coroutine library to allow you create infinity timers by iter ticks as I mentioned above https://github.com/zyxkad/cc/blob/master/coroutinex.lua#L488

SquidDev commented 3 weeks ago

With each receive call, another timer is being created

Note that this should only be the case when a timeout is passed. If you're receiving 10k messages a second, you probably don't need a timeout in the first place!

Though yes, we should cancel the timer if it expires.

helpmyRF24isntworking commented 3 weeks ago

I actually have a coroutine library to allow you create infinity timers by iter ticks as I mentioned above https://github.com/zyxkad/cc/blob/master/coroutinex.lua#L488 @zyxkad

Very interesting, ill look into it in more detail in the next days. Thanks!

helpmyRF24isntworking commented 3 weeks ago

Note that this should only be the case when a timeout is passed. If you're receiving 10k messages a second, you probably don't need a timeout in the first place!

Though yes, we should cancel the timer if it expires. @SquidDev

Not in testing, but i need the timout to trigger onNoAnswer events in "real" applications. While i send lots of messages, they all come from different clients and require ACKs (like MQTT QOS 1). If one client/broker has an issue or lags behind, this must result an according reaction for that specific message. If i dont check for missed messages via timers, a client might get "stuck" without being able to resolve the issue. Once the timer runs out, the clients have to republish their accumulated logs and are only allowed to discard them after they received an ack.

The 10k messages are quite excessive for testing purposes but still highlight the overhead caused by the timers.

cc-tweaked / CC-Tweaked