celestialorb / factorio-prometheus-exporter

A mod for Factorio to produce a variety of Prometheus metrics.
MIT License
9 stars 2 forks source link

Feature Suggestion: Improving Performance by Optionally Disabling Client-side Execution #16

Open ZCK12 opened 5 days ago

ZCK12 commented 5 days ago

Now that the metrics interval can be set with a mod option, I was able to diagnose some minor hitching happening for my friends connected to the server. They are small freezes that happen exactly as the mod is gathering the internal statistics and last for maybe 50ms.

People with slower computers seem to experience longer freezes, so I think this is caused by the mod gathering statistics locally on client machines rather than from the server itself.

I thought this might be from file I/O, so I changed the file permissions to block all writes into metrics json file. This didn't seem to affect the freezes at all, so I feel like the issue purely stems from enumerating the internal Factorio stats rather than disk I/O.

Would it be possible to create a mod option which disables gathering stats on client machines?

celestialorb commented 4 days ago

Yes! It would certainly be possible and I was actually thinking of adding this in soon. Unfortunately my workstation died this week and so I'm out of commission until my new one arrives (or I replace the borked motherboard).

I have a backup laptop that I could create the feature on, but testing it would be very slow as it's nearly seven years old and Factorio takes a while to load...though it actuallly might be the perfect candidate for testing for a difference in performance now that I think about it.

celestialorb commented 4 days ago

One issue that I can think of is, what value should be used as the default for enabling the exporting of metrics? I think defaulting to enabled is the best way since it (a) preserves current behavior and (b) makes installing the mod all that's necessary (no need to go into the settings to enable it).

However, I'd kind of want the mod to automatically be disabled client-side but there's not a way to distinguish if we're running "client-side" or "server-side" in Factorio, meaning everyone who installs the mod in order to join the server will then need to disable it if it affects their performance. I'm pretty sure that's the best we can do here.

ZCK12 commented 4 days ago

I was also thinking about how to differentiate between client/server. I know the write_file helper function has an optional for_player? argument that disables writing to client folders. But I can't think of how to prevent executing the data collection part of the script using this...

celestialorb commented 4 days ago

I've been looking into this today and I'm really struggling to find a way to disable it client-side. By design, Factorio mods are intended to run in exactly the same manner between server and client in order to preserve determinism so that makes this difficult; a client can't simply choose to have different mod settings than the server. As a result of that design, all mod settings between clients and the server must match.

I was thinking I could leverage the settings-per-user option but unfortunately I don't think that would work either as I'd need to know the player associated with the Factorio instance running and that doesn't seem to be available in any context.

I don't think using the write_file helper would work either as AFAIK there isn't a way to read any external files so a setting can't be passed that way.

This might be possible by somehow abusing RCON, but I'm not confident and I'm not sure I'd want to really dig into that option to solve this.

Until we can figure out a way to either be able to associate the player of the current Factorio instance or otherwise have differing behavior between Factorio instances I don't think we can disable it client-side. I'd recommend setting a higher value for the metrics interval, as the default of 60 -- being once every 60 ticks -- is once per second, which is likely higher than you'd need anyway for a standard Prometheus + Grafana setup. If your Prometheus is scraping every 15 seconds you can set a metrics interval of 900 to match it, or 3600 if Prometheus is scraping every minute.

ZCK12 commented 4 days ago

Fair enough mate. This would be bending rules of Factorio's deterministic system, so it makes sense there's no straightforward way to do this.

It's a shame since obviously this mod wouldn't desynchronise game state, but that doesn't mean we're above the rules. Lock-step it is 😄

ChrislyBear-GH commented 3 days ago

How about trying to write a file only on the host machine - this could be a player hosted server or a dedicated server - and test for it?? Not sure if this works, but maybe that's a way to go.

-- control.lua

-- Function to determine if the current instance is the server or the host player
local function is_server()
    local test_filename = "__IsServerTest__.txt"
    local test_content = "server_check"

    -- Last parameter "for_player" = 0 only writes on the server. See LuaHelpers.html#write_file
    local success, _ = pcall(function()
        game.write_file(test_filename, test_content, false, 0)
    end)

    return success
end

-- Function to collect and write metrics
local function collect_metrics()
-- .... logic that runs in under 1/60th of a second!! if not, split the workload!....
    -- Serialize data to JSON and write to file 
    local json = game.table_to_json(data)
    game.write_file("metrics_log.txt", json .. "\n", true)
end

-- Event handler to collect metrics every 600 ticks (10 seconds)
script.on_nth_tick(600, function(event)
    if is_server() then
        collect_metrics()
    end
end)
celestialorb commented 3 days ago
-- Function to determine if the current instance is the server or the host player
local function is_server()
    local test_filename = "__IsServerTest__.txt"
    local test_content = "server_check"

    -- Last parameter "for_player" = 0 only writes on the server. See LuaHelpers.html#write_file
    local success, _ = pcall(function()
        game.write_file(test_filename, test_content, false, 0)
    end)

    return success
end

Hrm, that might work. I missed the part of providing 0 will only write to the server's output if present.. The question here is what is "the server" in this context. I have a feeling that providing 0 would still write the file out on all Factorio instances. I think I could test this by copying my local mod to a server I have setup so that the checksums match and I can join it.

Aside from this, I've been pondering on other solutions to the main issue:

Option (1) would be to identify the slow parts of metric collection and reduce the frequency of more resource-intensive collections (or add settings to configure how often certain metrics are collected).

Option (2) would be to potentially skip metric collection all together and write out larger blocks of data and let the Prometheus exporter be the component responsible for filtering through it for the metrics we're interested in. Though I'm fairly certain the additional time taken to write out the extra data to disk would be greater than the time saved by not looping through and filtering out for the metrics we're interested in; could try and see if I can test it though.

ChrislyBear-GH commented 3 days ago

Yes, that's a point of uncertainty, that I'm currently testing myself with a copy of your mod. How is a dedicated server and a player hosted game behaving differently when supplying a 0 for write file.

If it's the same, then we might have won the jackpot. If not, further investigaion is necessary.

Edit:

Tried it... didn't work. "success" in "is_server()" will always evaluate to "false"...

Edit2:

It was partly successful. The file will only be written on the server with the "0" as parameter for write_file. Yeah! But you won't get any indication if the file was written or not.

But I don't think, that the issue is the writing of a 100kb once every few seconds. I think writing 100kb in 1/60 of a sec (which is equal to about 6MB/sec write speed) should be no problem. The problem is probably the collection of the metrics. I think the problem lies here:

-- ...

local function update_item_metrics()
    metrics["forces"]["player"]["items"] = {}
    for surface, _ in pairs(game.surfaces) do
        metrics["forces"]["player"]["items"][surface] = {}
        for name, _ in pairs(prototypes.item) do
            metrics["forces"]["player"]["items"][surface][name] = {}
            metrics["forces"]["player"]["items"][surface][name]["consumption"] = game.forces["player"]
                .get_item_production_statistics(surface)
                .get_output_count(
                    name)
            metrics["forces"]["player"]["items"][surface][name]["production"] = game.forces["player"]
                .get_item_production_statistics(surface)
                .get_input_count(
                    name)
        end
    end
end

-- ...

First we iterate over EVERY item prototype. Then we call get_item_production_statistics() for EVERY item prototype AND filter it. Maybe it's faster to put the statistics into a variable and then loop over it?

-- ...

local function update_item_metrics()
    metrics["forces"]["player"]["items"] = {}
    for surface, _ in pairs(game.surfaces) do
        metrics["forces"]["player"]["items"][surface] = {}

        -- get the stats object once
        local stats = game.forces["player"].get_item_production_statistics(surface)

        for name, _ in pairs(prototypes.item) do
            metrics["forces"]["player"]["items"][surface][name] = {}
            metrics["forces"]["player"]["items"][surface][name]["consumption"] = stats .get_output_count(name)
            metrics["forces"]["player"]["items"][surface][name]["production"] = stats .get_input_count(name)
        end
    end
end

-- ...

Again, this hasn't been tested... just an idea :)

Chris

Edit3:

I had an idea and it works quite well. If we schedule each of the metric update functions for the coming ticks, we can space out the load. Therefore I'm scheduling the 8 update metrics functions every two ticks from the "on_nth_tick".

-- ...

local function update_surface_metrics()
    -- ...
end

local metric_functions = {
    update_time_metrics,
    update_item_metrics,
    update_fluid_metrics,
    update_pollution_metrics,
    update_surface_metrics,
    update_research_metrics,
    update_rocket_launch_metrics,
    write_metrics
}

local scheduled_functions = {}
local on_tick_handler_registered = false

-- on_tick handler
local function on_tick_handler(event)
    local current_tick = event.tick
    -- print("Handling tick: " .. current_tick)
    if scheduled_functions[current_tick] then
        for _, func in pairs(scheduled_functions[current_tick]) do
            -- print("Executing function on tick " .. current_tick)
            func()
        end
        scheduled_functions[current_tick] = nil
    end

    -- Unregister the on_tick handler if no functions are scheduled
    if next(scheduled_functions) == nil then
        script.on_event(defines.events.on_tick, nil)
        on_tick_handler_registered = false
        -- print("Scheduler empty, unregistering handler.")
    end
end

-- Function to schedule a function to be called in a certain number of ticks
local function schedule_function_in_ticks(ticks_in_future, func)
    local target_tick = game.tick + ticks_in_future
    if not scheduled_functions[target_tick] then
        scheduled_functions[target_tick] = {}
    end
    table.insert(scheduled_functions[target_tick], func)
    -- print("Current tick: " .. game.tick .. " - Scheduling function for tick: " .. target_tick) 

    -- Register the on_tick handler if it's not already registered
    if not on_tick_handler_registered then
        script.on_event(defines.events.on_tick, function(event) on_tick_handler(event) end)
        on_tick_handler_registered = true
        -- print("Registered handler")
    end
end

log("prometheus exporter mod setup starting")

script.on_event(defines.events.on_player_joined_game, update_player_metrics)
script.on_event(defines.events.on_player_joined_game, update_time_metrics)
script.on_event(defines.events.on_player_left_game, update_player_metrics)
script.on_event(defines.events.on_player_left_game, update_time_metrics)

-- Schedule all update functions.
local function schedule_stuff()
    for i, func in ipairs(metric_functions) do
        schedule_function_in_ticks(i*2, func)
    end
end

script.on_init(function()
    -- Schedule the collector functions over the next ticks
    -- print("Inital scheduling...")
    schedule_stuff()
end)

local update_rate = settings.startup["prometheus-exporter-metrics-update-every-nth-tick"].value
script.on_nth_tick(update_rate, function(event)
    -- print("Rescheduling every " .. update_rate .. " ticks...")
    schedule_stuff()
end)

log("prometheus exporter mod setup completed")
ZCK12 commented 3 days ago

It feels like we're going down the rabbit hole a bit here 😄

I like the idea of spacing out the metrics collection. While it adds complexity to an otherwise very simple mod, it would definitely improve performance.

On a side note... while I don't have any experience with Factorio mods, I feel like the scheduling logic there might be overkill. Could it be reduced to something along the lines of this:

local index = (current_tick % #metric_functions) + 1
-- extra code for delays between gathering, or between batches.
local func = metric_functions[index]
if func then
    func()
end

Then register it to run every tick? It would save the complexity of trying to schedule events in the future as well as making the code stateless.

ChrislyBear-GH commented 3 days ago

I think the "extra code to delay stuff" is exactly what I wrote above 😂

Only taking the mod and hoping for the best won't be enough, I think. In my opinion it should be more event based than tick based. E.g. update build statistic only if something has been built or destroyed. This saves valuable compute time.

I don't think going completely stateless would be beneficial, because the metrics themselves are already carried along and are part of the mod's state. Completely stateless would mean having to dump all the stats every time, but that's too expensive.

I had a look at other stats mods like graftorio2 etc.: They are very event based and run only every 5 minutes or so.

Where to go from here exactly... I'm not sure. It seems like dumping stats is rather costly, which is a pity.

Chris

celestialorb commented 2 days ago

Hrm, my initial thought when authoring this mod was that updating on events would likely end up being more process-intensive than updating periodically (e.g. updating every time an item is produced could cause tons of updates per second versus updating the number of items produced periodically once a second).

I think the next step for me is to start profiling some of these sections of code. For example the get_item_production_statistics() may be more expensive than I had assumed. Looks like Factorio comes with a profiler so I'll get started on collecting some data.

Also, thanks for the input! Really hoping to get this mod in state where it can be used without performance issues.

@ChrislyBear-GH @ZCK12 By the way, what's the state of your saves when you experience the performance issues? I haven't encountered them personally but all my saves are in early-game states and I haven't touched any of the Space Age DLC yet. Do you experience it in early-game?

ChrislyBear-GH commented 2 days ago

I'm currently at a rather early state in the game and I'm running it on a quite capable server. Not sure why the stat collection takes so long.

Since this really felt like a rabbit hole, I have changed my approach: I have taken the apporach of https://github.com/janten/factorio-stats and I'm now running his little python container, that serves stats on demand directly to Prometheus! The stats are gathered via RCON and aren't impacting the game performance anymore! Not sure why this is different though. The advantage is that this approach doesn't requre any mod at all!

I have revised the LUA script, that is executed on a scrape. This is my version:

-- gather-stats.lua

rcon.print('factorio_tick ' .. game.tick)
rcon.print('factorio_ticks_played ' .. game.ticks_played)
rcon.print('factorio_connected_players ' .. #game.connected_players)

if #game.connected_players > 0 then 

    for k, force in pairs(game.forces) do
        rcon.print('factorio_rockets_launched{force="' .. force.name .. '"} ' .. force.rockets_launched)
        rcon.print('factorio_research_progress{force="' .. force.name .. '"} ' .. force.research_progress)
        for _, item in ipairs(force.items_launched) do
            rcon.print('factorio_items_launched{force="' .. force.name .. '", name="' .. item.name or "nil" .. '", count="' .. item.count or 0 .. '", quality="' .. item.quality .. '"} ')
        end
        for surface, _ in pairs(game.surfaces) do
            for entity, amount in pairs(force.get_kill_count_statistics(surface).input_counts) do
                rcon.print('factorio_killed{surface="' .. surface .. '", force="' .. force.name .. '", name="' .. entity .. '"} ' .. amount)
            end
            for entity, amount in pairs(force.get_kill_count_statistics(surface).output_counts) do
                rcon.print('factorio_killed_by{surface="' .. surface .. '", force="' .. force.name .. '", name="' .. entity .. '"} ' .. amount)
            end
            for item, amount in pairs(force.get_item_production_statistics(surface).input_counts) do
                rcon.print('factorio_production{surface="' .. surface .. '", type="item", force="' .. force.name .. '", name="' .. item .. '"} ' .. amount)
            end
            for item, amount in pairs(force.get_item_production_statistics(surface).output_counts) do
                rcon.print('factorio_consumption{surface="' .. surface .. '", type="item", force="' .. force.name .. '", name="' .. item .. '"} ' .. amount)
            end
            for fluid, amount in pairs(force.get_fluid_production_statistics(surface).input_counts) do
                rcon.print('factorio_production{surface="' .. surface .. '", type="fluid", force="' .. force.name .. '", name="' .. fluid .. '"} ' .. amount)
            end
            for fluid, amount in pairs(force.get_fluid_production_statistics(surface).output_counts) do
                rcon.print('factorio_consumption{surface="' .. surface .. '", type="fluid", force="' .. force.name .. '", name="' .. fluid .. '"} ' .. amount)
            end
            for entity, amount in pairs(force.get_entity_build_count_statistics(surface).input_counts) do
                rcon.print('factorio_production{surface="' .. surface .. '", type="entity", force="' .. force.name .. '", name="' .. entity .. '"} ' .. amount)
            end
            for entity, amount in pairs(force.get_entity_build_count_statistics(surface).output_counts) do
                rcon.print('factorio_consumption{surface="' .. surface .. '", type="entity", force="' .. force.name .. '", name="' .. entity .. '"} ' .. amount)
            end
        end
    end
    for surface, _ in pairs(game.surfaces) do
        for pol_ent, amount in pairs(game.get_pollution_statistics(surface).input_counts) do
            rcon.print('factorio_production{surface="' .. surface .. '", type="pollution", force="player", name="' .. pol_ent .. '"} ' .. amount)
        end
        for pol_ent, amount in pairs(game.get_pollution_statistics(surface).output_counts) do
            rcon.print('factorio_consumption{surface="' .. surface .. '", type="pollution", force="player", name="' .. pol_ent .. '"} ' .. -amount)
        end
    end

end

I'm currently gathering the stats every 10 seconds and I don't notice anything, no stutter at all. The only thing missing for me are energy statistics. Maybe I'll extend the script soon.

Chris

celestialorb commented 2 days ago

Hrm, interesting approach. Use of RCON might mean that people will need to additionally expose the RCON port, but maybe not if they run the exporter as a sidecar in Kubernetes. Would certainly love to remove the need to install a mod as it increases the development turnaround time and installation complexity!

I'll experiment with this approach and see if it works for me as well, thanks! Would also mean that clients aren't exporting metrics!

celestialorb commented 2 days ago

Hitting some minor downsides to using RCON for this. One of them is that it seems each RCON command issued advances the game by one tick, meaning that even if no one is online the game will continue to run, though the rate at which will depend on how often scraping is occurring.

We can control whether or not the game is paused from the RCON commands, so we can pause it if no one is online, but this means when a player joins we'd have to wait until the next scrape for the RCON command to unpause the game. We could have another thread performing the pause/unpause at a higher rate to avoid this though, but this might interfere with other mods that have autopause features.

From my investigations today I think this is primarily stemming from the collection of the entity count. I think it would be best to disable / remove the entity count metric until I can break up the collection of that particular metric.

celestialorb commented 2 days ago

Been working on an RCON-only implementation of it this evening and believe I have it in a working and performant state, though with one fairly major change in design. I've split up the workload by separating out the metrics by group into different Prometheus endpoints (e.g. /metrics/time, /metrics/production, /metrics/pollution, etc.) which allows the user to independently scrape the metrics at different intervals, and/or leave out entire groups of metrics all together if collecting them is too process-intensive.

In order to provide some sort of backwards compatability of the previous endpoint (even though this would still be a breaking change) the /metrics endpoint will change to /metrics/all and will collect and return all metrics.

I'll clean up the implementation tonight and hopefully have a release of a new container image later today or tomorrow.

ZCK12 commented 2 days ago

You're doing god's work mate

celestialorb commented 2 days ago

@ZCK12 @ChrislyBear-GH I've gone ahead and released an early version of the RCON-implementation of this exporter. You can find the v0.3.0-alpha container image here.

Feel free to give it a try, and remember that you won't need the Factorio mod anymore so feel free to remove that from your servers if you're giving this early version a try!

If you need some help with configuring the exporter the README.md file has been updated in #18 to reflect the new exporter.

Once I've put it under some more testing I'll merge it in and release a production version of it.