EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
2.04k stars 191 forks source link

Performance Issues writing WAL files to S3 #1018

Open RickVenema opened 2 days ago

RickVenema commented 2 days ago

As discussed with @mnencia last friday:

We use CloudNativePG and write backups and WAL archiving to a local S3 appliance.

We noticed that writing WAL files did not perform well causing the WAL location to flood. And PostgreSQL to run into issues. While investigation options to optimize the connection to our local S3 appliance we tried to replicate the situation using a python script which generates 16MB of random data and uploaded to our local S3 appliance using the Boto3 library. We noticed that our script seemed surprisingly faster then the implementation in Barman-cloud-wal-archive used for archiving wal to S3. BTW, full backup does meet up with performance as we also reached with our python script.

We changed our script to use the implemented as build into barman-cloud-wal-archiving. And reproduced similar low performance as experienced with barman-cloud.

We expect the difference the way barman-cloud uses the Boto3 API. When using low level API (put_object) performance is high. When using high level API (upload_fileobj) we experience low performance.

Note that create multipart used by full backup also delivers better performance.

We dont know the exact reason, it might be that upload fileobj streams from a file object and put_object uses data that is loaded into memory. Also note that upload_fileobj is the preferred method to upload files >=5GB but WAL segment files are only 16MB.

We will also submit our test script for your information, could you please consider changing code from upload_fileobj to put_object. If you want to we could provide a pull request.

RickVenema commented 2 days ago

@sebasmannem Please follow this issue

RickVenema commented 2 days ago
import json
import os
import time

import boto3

WAL_SIZE = 16000000
ITERATIONS = 100

def read_creds(cred_file):
    """
    Function to read credentials configuration for the S3 connection.
    Must contain the following fields in JSON format:
    - bucket_name
    - endpoint_url
    - access_key
    - secret_key

    :param cred_file: The file location of the credentials file
    :return: credentials dictionary
    """
    with open(cred_file, "r") as f:
        cred_data = json.load(f)
    return cred_data

def write_wal_file():
    with open("test.wal", "wb") as f:
        f.write(bytes(WAL_SIZE))

def create_session(creds):
    """
    Create S3 connection based on Session and resource
    :return: client, latency in ns
    """
    start = time.perf_counter_ns()
    s3 = boto3.Session(
        aws_access_key_id=creds['access_key'],
        aws_secret_access_key=creds['secret_key']
    )
    s3_client = s3.resource("s3", endpoint_url=creds['endpoint_url'],
                            config=CONFIG)
    end = time.perf_counter_ns() - start
    print(f"Connection made in {end / 1000000}ms ")
    return s3_client

def run_session_file_in_memory(creds):
    print("Session with put_object")
    session = create_session(creds)
    write_wal_file()
    latency = [0 for _ in range(0, ITERATIONS)]  # predefined list
    throughput_s = time.perf_counter_ns()
    for i in range(ITERATIONS):
        l_s = time.perf_counter_ns()
        with open("test.wal", "rb") as f:
            data = f.read()
        session.meta.client.put_object(Body=data, Bucket=creds['bucket_name'], Key='00_test_wal')
        latency[i] = time.perf_counter_ns() - l_s
    throughput_e = time.perf_counter_ns() - throughput_s

    # Calculate Latency
    latency = sum([_ / 1000000 for _ in latency]) / ITERATIONS
    print(f"Average Latency: {latency}ms")

    # Calculate Throughput
    throughput_result = (ITERATIONS * (WAL_SIZE / 1000000)) / (throughput_e / 1000000000) * 8
    print(f"Throughput: {throughput_result}MBit/s")

def run_session_file_reading(creds):
    print("Session with upload_fileobj")
    session = create_session(creds)
    write_wal_file()
    latency = [0 for _ in range(0, ITERATIONS)]  # predefined list
    throughput_s = time.perf_counter_ns()
    for i in range(ITERATIONS):
        l_s = time.perf_counter_ns()
        with open("test.wal", "rb") as wal_file:
            session.meta.client.upload_fileobj(
                Fileobj=wal_file, Bucket=creds['bucket_name'], Key='00_test_wal'
            )
        latency[i] = time.perf_counter_ns() - l_s
    throughput_e = time.perf_counter_ns() - throughput_s

    # Calculate Latency
    latency = sum([_ / 1000000 for _ in latency]) / ITERATIONS
    print(f"Average Latency: {latency}ms")

    # Calculate Throughput
    throughput_result = (ITERATIONS * (WAL_SIZE / 1000000)) / (throughput_e / 1000000000) * 8
    print(f"Throughput: {throughput_result}MBit/s")

def run_tests():
    creds = read_creds("creds.json")
    os.environ['REQUESTS_CA_BUNDLE'] = creds['REQUESTS_CA_BUNDLE']

    run_session_file_reading(creds)
    run_session_file_in_memory(creds)

if __name__ == '__main__':
    run_tests()

"""
Creds.json looks like this:
{
  "bucket_name": "",
  "endpoint_url": "",
  "access_key": "",
  "secret_key": "",
  "REQUESTS_CA_BUNDLE": ""
}
"""