Aiven-Open / pghoard

PostgreSQL® backup and restore service
http://aiven-open.github.io/pghoard/
Apache License 2.0
1.3k stars 99 forks source link

refresh site creds on file fetcher processes #613

Closed kathia-barahona closed 1 month ago

kathia-barahona commented 5 months ago

About this change - What it does

pghoard dispatches processes in charge of fetching files from sites. When starting such processes, pghoard provides its config as an argument. Meaning that if pghoard gets restarted with a different config (e.g object storage got new credentials), the running file fetcher processes won't acknowledge this and will keep using the old config.

So, in order to change this behavior. Its better to provide the current config directly on each task instead of the process itself. This way the process can update the transfer to the object storage (in case the site's config changed).

Resolves: #BF-2385

rikonen commented 4 months ago

Since those Queues are based on pipes, it means we will pipe a lot of redundant data on each IPC, which may hurt performance as those will need to serialized / deserialized.

Is this really a problem? I thought of the same myself but the full config should be in the range of a few kilobytes and the cost of just sending it all over does not seem especially high unless I'm missing some big inefficiency here.

rdunklau commented 4 months ago

Looks like a default config on a fresh service, once pickled, is in the 5kB range. Compared to the other members of that tuple, that's huge so performance impact should at least be evaluated. But regardless of that point, the other one still stands: what happens to all previously enqueued task, which will be fetched using the wrong config ?

For minimal changes, we could use a separate queue for config update events: checking that without blocking means we could update config at any time and all subsequent tasks will use the new config.

rikonen commented 4 months ago

The acute problem being fixed here is that pghoard does not refresh keys at all so key rotation cannot be completed with restarting the application. For that particular problem it is irrelevant if the old key is used a bit longer as there's anyway long grace period of key inactivity after which old key is disabled. For cases where storage location actually gets changed that could be more relevant, though in that case too the exact point when data starts getting read from / written to the other location is arbitrary and touching the queued events does not feel like it would have big impact one way or the other.

Separate queue would work but I'd check the actual performance impact first. My guess is the overhead is low single digit milliseconds, which would probably be acceptable.

rdunklau commented 4 months ago

I'm fine with both of those points, as long as they are considered :-)

rikonen commented 2 months ago

Is there a plan to work on this? The problem is still very valid.

kathia-barahona commented 2 months ago

Is there a plan to work on this? The problem is still very valid.

Hi! Sorry, I had to pause a bit this task. I agreed with @rdunklau that I'll measure how much this affect on restoration, will give some prio

kathia-barahona commented 2 months ago

@rdunklau I did multiples test runs and measured restoration times. I saw no major difference after including these changes. Considered db sizes (mb): 100, 300, 600, 1000. Dataset was not super big, but I don't think it might have a bigger impact.

rikonen commented 2 months ago

It was unlikely the performance would be so much worse that it would be visible unless you used very heavy stress test but in this case a synthetic test should be quite sufficient because you can easily simulate the config being passed as part of tasks or it not being passed there. For example the following test app should work:

import json
import multiprocessing
import time

def task_handler(task_queue: multiprocessing.Queue, result_queue: multiprocessing.Queue) -> None:
    while (task := task_queue.get()):
        id_ = task["id"]
        result_queue.put(id_)

def main() -> None:
    manager = multiprocessing.Manager()
    result_queue = manager.Queue()
    task_queue = manager.Queue()
    process = multiprocessing.Process(target=task_handler, args=(task_queue, result_queue))
    process.start()
    # Wait for one message to be processed to ensure target process is running normally
    task_queue.put({"id": -1})
    result_queue.get()
    with open("pghoard.json") as f:
        full_config = json.load(f)
    start_time = time.monotonic()
    task_count = 10_000
    for id_ in range(task_count):
        task_queue.put({"id": id_, "config": full_config})
    task_queue.put(None)
    for _ in range(task_count):
        result_queue.get()
    duration_ms = (time.monotonic() - start_time) * 1000
    time_per_task_ms = duration_ms / task_count
    print(f"Processed {task_count} tasks in {duration_ms:.1f} milliseconds; {time_per_task_ms:.2f} milliseconds")

if __name__ == "__main__":
    main()

This gives me consistently 0.11ms processing time per task when not passing the full config and 0.13ms processing time when passing it. So the overhead isn't even full milliseconds but rather 0.02 milliseconds, which is completely negligible given the actual task processing is way heavier than the 0.11ms no-op time.

@rdunklau does this validation seem sufficient to you?

rdunklau commented 1 month ago

Sorry I missed the previous comments.