EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
2.1k stars 193 forks source link

barman-cloud-backup process is stuck because the disk of Minio is full. #788

Open yanboer opened 1 year ago

yanboer commented 1 year ago

ISSUE:

When I use "/barman-cloud-backup -h 127.0.0.1 -p 5433 --immediate-checkpoint --cloud-provider aws-s3 -P default --read-timeout 600 --endpoint-url endpoint s3://path/backup pg12" to backup a cluster, the backup process is stuck for a long time.

image

some infomation

the "barman-cloud-backup" process started at 11:01 CST, and at 11:05 CST, Minio reported a "disk path full" error. After that, the barman process has been blocked. The Minio log shows that the disk space was full during the backup process, and there were temporary files remaining in Minio from the corresponding time. (Due to a timezone setting issue, Minio and the Postgresql pod have an 8-hour time difference.)

image

image

By using "strace -p pid" command to check the process information, two "barman-cloud-backup" processes are found to be in the following state:

futex(0x7ff938000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

image

environment:

PostgreSQL on Kubernetes(About 47G data volume)

postgres=# select version();
                                              version
----------------------------------------------------------------------------------------------------
 PostgreSQL 12.12 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

# du -sh /var/lib/postgresql/data/pg_data/
47G     /var/lib/postgresql/data/pg_data

minio:

/ # minio version
Version: 2019-08-07T01:59:21Z
Release-Tag: RELEASE.2019-08-07T01-59-21Z
Commit-ID: 930943f058f01f37cfbc2265d5f80ea7026ec55d

barman-cloud-backup:

/# barman-cloud-backup -V
barman-cloud-backup 3.3.0

Can someone take a look at and give me some guidance?

Thanks.

mikewallace1979 commented 1 year ago

Hi @yanboer - thanks for reporting the issue.

The fact that barman is hanging instead of exiting when minio is unable to accept further writes looks like a bug in barman - we will do some further investigation and see if we can improve the behaviour in this scenario.

yanboer commented 1 year ago

Hi @yanboer - thanks for reporting the issue.

The fact that barman is hanging instead of exiting when minio is unable to accept further writes looks like a bug in barman - we will do some further investigation and see if we can improve the behaviour in this scenario.

Please let me know if there is anything I can do to assist in the investigation.

Thank you again, and have a great day!

mikewallace1979 commented 1 year ago

I have reproduced a disk full scenario using minio version 2022-09-17T00:09:45Z by creating a 2GB partition on my local disk in /Volumes/Minio and running the container as follows:

docker run -d \
   -p 9000:9000 \
   -p 9090:9090 \
   --name minio-small \
   -v /Volumes/Minio:/data \
   minio/minio:RELEASE.2022-09-17T00-09-45Z server /data --console-address ":9090"

I get the following behaviour from the barman-cloud-backup command when backing up a PostgreSQL instance with over 2GB of data:

$ barman-cloud-backup s3://test-bucket main --endpoint-url=http://minio-small:9000
2023-04-26 15:27:46,045 [7255] ERROR: Upload error: An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.: cause(write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error) (worker 1)
2023-04-26 15:27:46,064 [7248] ERROR: Error received from upload worker: An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.: cause(write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error)
2023-04-26 15:27:46,077 [7248] ERROR: Backup failed uploading data (An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.: cause(write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error))
$ echo $?
1

In the minio logs I can see it returns HTTP 500 internal server errors and logs an input/output error:

API: PutObjectPart
Time: Wed Apr 26 2023 15:27:46 GMT+0100 (British Summer Time)
DeploymentID: e0da9104-ca18-440c-9722-cbd22e7c950b
RequestID: 175985A1D2C65247
RemoteHost: 172.27.0.2
UserAgent: Boto3/1.9.253 Python/3.8.10 Linux/5.15.49-linuxkit Botocore/1.29.77 Resource
Error: write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error (*fs.PathError)

Backtrace: 
0: internal/logger/logger.go:259:logger.LogIf()
1: cmd/api-errors.go:2164:cmd.toAPIErrorCode()
2: cmd/api-errors.go:2189:cmd.toAPIError()
3: cmd/object-multipart-handlers.go:486:cmd.objectAPIHandlers.PutObjectPartHandler()
4: net/http/server.go:2084:http.HandlerFunc.ServeHTTP()

I tried again using the exact same minio version as you:

$ docker run -d \
   -p 9000:9000 \
   -p 9090:9090 \
   --name minio-small \
   -v /Volumes/Minio:/data \
   -e MINIO_ACCESS_KEY=minioadmin \
   -e MINIO_SECRET_KEY=minioadmin \
   minio/minio:RELEASE.2019-08-07T01-59-21Z server /data

The process still terminates (though with a different error this time, due to the older minio version):

$ barman-cloud-backup s3://test-bucket main --endpoint-url=http://minio-small:9000
2023-04-26 15:46:12,741 [7375] ERROR: Upload error: Connection was closed before we received a valid response from endpoint URL: "http://minio-small:9000/test-bucket/main/base/20230426T154601/data.tar?uploadId=342ffcef-11a2-4e5f-92fe-7265776a7bc0&partNumber=2". (worker 0)
2023-04-26 15:46:12,766 [7369] ERROR: Error received from upload worker: Connection was closed before we received a valid response from endpoint URL: "http://minio-small:9000/test-bucket/main/base/20230426T154601/data.tar?uploadId=342ffcef-11a2-4e5f-92fe-7265776a7bc0&partNumber=2".
2023-04-26 15:46:12,779 [7369] ERROR: Backup failed uploading data (Connection was closed before we received a valid response from endpoint URL: "http://minio-small:9000/test-bucket/main/base/20230426T154601/data.tar?uploadId=342ffcef-11a2-4e5f-92fe-7265776a7bc0&partNumber=2".)
2023-04-26 15:46:13,791 [7369] ERROR: Backup failed uploading backup.info file (An error occurred (XMinioStorageFull) when calling the PutObject operation: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.)

If I fill up the /data partition on minio so that it is already beyond its minimum free disk threshold, then barman-cloud-backup will fail more catastrophically:

$ barman-cloud-backup s3://test-bucket main --endpoint-url=http://minio-small:9000
2023-04-26 15:57:46,495 [7473] ERROR: Backup failed uploading data (An error occurred (InternalError) when calling the CreateMultipartUpload operation (reached max retries: 4): We encountered an internal error, please try again.)
2023-04-26 15:57:56,741 [7473] ERROR: Backup failed uploading backup.info file (An error occurred (InternalError) when calling the PutObject operation (reached max retries: 4): We encountered an internal error, please try again.)
Exception ignored in: <function _Stream.__del__ at 0x7f8b7d8a73a0>
Traceback (most recent call last):
  File "/usr/lib/python3.8/tarfile.py", line 410, in __del__
    self.close()
  File "/usr/lib/python3.8/tarfile.py", line 460, in close
    self.fileobj.write(self.buf)
  File "/barman/barman/cloud.py", line 208, in write
    self.flush()
  File "/barman/barman/cloud.py", line 225, in flush
    self.upload_metadata = self.cloud_interface.create_multipart_upload(
  File "/barman/barman/cloud_providers/aws_s3.py", line 335, in create_multipart_upload
    return self.s3.meta.client.create_multipart_upload(
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 943, in _make_api_call
    http, parsed_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 966, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "/usr/local/lib/python3.8/dist-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.8/dist-packages/botocore/signers.py", line 149, in sign
    signature_version = self._choose_signer(
  File "/usr/local/lib/python3.8/dist-packages/botocore/signers.py", line 219, in _choose_signer
    handler, response = self._event_emitter.emit_until_response(
ReferenceError: weakly-referenced object no longer exists

However it still exits and returns a nonzero exit status - I have not been able to reproduce the issue you are seeing where barman-cloud-backup hangs indefinitely.

I wonder if perhaps the boto3 version you are using is a factor - could you run pip freeze | grep boto3 and share the output?

yanboer commented 1 year ago

thank you for your reply. The python3 and boto3 version information of my environment is as follows

# python3
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import boto3
>>>
>>> print(boto3.__version__)
1.26.97
>>>

To reduce the size of the built image, I removed pip via apt-get purge -y python3-pip after installing the necessary software

yanboer commented 1 year ago

steps to reproduce

1)minio

Mount the 50G disk to /data

mkfs -t ext4 /dev/vdc
mkdir /data
mount /dev/vdc /data

run minio

docker run -d \
   -p 9000:9000 \
   -p 9090:9090 \
   --name minio \
   -v /data:/data \
   -e MINIO_ACCESS_KEY=minioaccesskey \
   -e MINIO_SECRET_KEY=miniosecretkey \
   minio/minio:RELEASE.2019-08-07T01-59-21Z server /data

2)postgresql & barman

# 1. start postgresql
docker run -itd -h monitor --name monitor -e PG_MODE=monitor -e AUTOCTL_NODE_PASSWORD=password -e EXTERNAL_HOSTNAME=monitor radondb/radondb-postgresql:12.12-v1.2.0 auto_failover

# 2. enter container and create database
docker exec -it monitor bash
psql -U postgres -p 55555 -c "create database test;"

# 3. generate data
pgbench -i -s 2000 -p 55555 -U postgres test

# du -sh /var/lib/postgresql/data/pg_data/
31G     /var/lib/postgresql/data/pg_data/

# 4. generate s3 profile
echo -e "minioaccesskey\nminiosecretkey\nNone\njson\n" | aws configure --profile default > /dev/null

# 5. backup to minio, 172.17.0.7 is minio container ip
barman-cloud-backup -h 127.0.0.1 -p 55555 --immediate-checkpoint --cloud-provider aws-s3 -P default --read-timeout 600 --endpoint-url http://172.17.0.7:9000 s3://pg-backup pg-12

3)result

minio: image

barman: image image

yanboer commented 1 year ago

I have reproduced a disk full scenario using minio version 2022-09-17T00:09:45Z by creating a 2GB partition on my local disk in /Volumes/Minio and running the container as follows:

docker run -d \
   -p 9000:9000 \
   -p 9090:9090 \
   --name minio-small \
   -v /Volumes/Minio:/data \
   minio/minio:RELEASE.2022-09-17T00-09-45Z server /data --console-address ":9090"

I get the following behaviour from the barman-cloud-backup command when backing up a PostgreSQL instance with over 2GB of data:

$ barman-cloud-backup s3://test-bucket main --endpoint-url=http://minio-small:9000
2023-04-26 15:27:46,045 [7255] ERROR: Upload error: An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.: cause(write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error) (worker 1)
2023-04-26 15:27:46,064 [7248] ERROR: Error received from upload worker: An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.: cause(write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error)
2023-04-26 15:27:46,077 [7248] ERROR: Backup failed uploading data (An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 4): We encountered an internal error, please try again.: cause(write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error))
$ echo $?
1

In the minio logs I can see it returns HTTP 500 internal server errors and logs an input/output error:

API: PutObjectPart
Time: Wed Apr 26 2023 15:27:46 GMT+0100 (British Summer Time)
DeploymentID: e0da9104-ca18-440c-9722-cbd22e7c950b
RequestID: 175985A1D2C65247
RemoteHost: 172.27.0.2
UserAgent: Boto3/1.9.253 Python/3.8.10 Linux/5.15.49-linuxkit Botocore/1.29.77 Resource
Error: write /data/.minio.sys/tmp/3bf0b482-f063-4b82-9acd-1d037f843e00/part.2: input/output error (*fs.PathError)

Backtrace: 
0: internal/logger/logger.go:259:logger.LogIf()
1: cmd/api-errors.go:2164:cmd.toAPIErrorCode()
2: cmd/api-errors.go:2189:cmd.toAPIError()
3: cmd/object-multipart-handlers.go:486:cmd.objectAPIHandlers.PutObjectPartHandler()
4: net/http/server.go:2084:http.HandlerFunc.ServeHTTP()

I tried again using the exact same minio version as you:

$ docker run -d \
   -p 9000:9000 \
   -p 9090:9090 \
   --name minio-small \
   -v /Volumes/Minio:/data \
   -e MINIO_ACCESS_KEY=minioadmin \
   -e MINIO_SECRET_KEY=minioadmin \
   minio/minio:RELEASE.2019-08-07T01-59-21Z server /data

The process still terminates (though with a different error this time, due to the older minio version):

$ barman-cloud-backup s3://test-bucket main --endpoint-url=http://minio-small:9000
2023-04-26 15:46:12,741 [7375] ERROR: Upload error: Connection was closed before we received a valid response from endpoint URL: "http://minio-small:9000/test-bucket/main/base/20230426T154601/data.tar?uploadId=342ffcef-11a2-4e5f-92fe-7265776a7bc0&partNumber=2". (worker 0)
2023-04-26 15:46:12,766 [7369] ERROR: Error received from upload worker: Connection was closed before we received a valid response from endpoint URL: "http://minio-small:9000/test-bucket/main/base/20230426T154601/data.tar?uploadId=342ffcef-11a2-4e5f-92fe-7265776a7bc0&partNumber=2".
2023-04-26 15:46:12,779 [7369] ERROR: Backup failed uploading data (Connection was closed before we received a valid response from endpoint URL: "http://minio-small:9000/test-bucket/main/base/20230426T154601/data.tar?uploadId=342ffcef-11a2-4e5f-92fe-7265776a7bc0&partNumber=2".)
2023-04-26 15:46:13,791 [7369] ERROR: Backup failed uploading backup.info file (An error occurred (XMinioStorageFull) when calling the PutObject operation: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.)

If I fill up the /data partition on minio so that it is already beyond its minimum free disk threshold, then barman-cloud-backup will fail more catastrophically:

$ barman-cloud-backup s3://test-bucket main --endpoint-url=http://minio-small:9000
2023-04-26 15:57:46,495 [7473] ERROR: Backup failed uploading data (An error occurred (InternalError) when calling the CreateMultipartUpload operation (reached max retries: 4): We encountered an internal error, please try again.)
2023-04-26 15:57:56,741 [7473] ERROR: Backup failed uploading backup.info file (An error occurred (InternalError) when calling the PutObject operation (reached max retries: 4): We encountered an internal error, please try again.)
Exception ignored in: <function _Stream.__del__ at 0x7f8b7d8a73a0>
Traceback (most recent call last):
  File "/usr/lib/python3.8/tarfile.py", line 410, in __del__
    self.close()
  File "/usr/lib/python3.8/tarfile.py", line 460, in close
    self.fileobj.write(self.buf)
  File "/barman/barman/cloud.py", line 208, in write
    self.flush()
  File "/barman/barman/cloud.py", line 225, in flush
    self.upload_metadata = self.cloud_interface.create_multipart_upload(
  File "/barman/barman/cloud_providers/aws_s3.py", line 335, in create_multipart_upload
    return self.s3.meta.client.create_multipart_upload(
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 943, in _make_api_call
    http, parsed_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 966, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "/usr/local/lib/python3.8/dist-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "/usr/local/lib/python3.8/dist-packages/botocore/signers.py", line 149, in sign
    signature_version = self._choose_signer(
  File "/usr/local/lib/python3.8/dist-packages/botocore/signers.py", line 219, in _choose_signer
    handler, response = self._event_emitter.emit_until_response(
ReferenceError: weakly-referenced object no longer exists

However it still exits and returns a nonzero exit status - I have not been able to reproduce the issue you are seeing where barman-cloud-backup hangs indefinitely.

I wonder if perhaps the boto3 version you are using is a factor - could you run pip freeze | grep boto3 and share the output?

I think you should consider increasing minio disk size as well as postgresql data volume.

mikewallace1979 commented 1 year ago

I think you should consider increasing minio disk size as well as postgresql data volume.

Thanks for the advice - this worked perfectly and I can now reproduce the issue.

Manually terminating the parent process shows exactly where the processes are stuck:

...
2023-04-27 08:29:55,349 [8104] ERROR: Upload error: Connection was closed before we received a valid response from endpoint URL: "http://minio-medium:9000/test-bucket/main/base/20230427T082428/data.tar?uploadId=3eab6ac7-f3eb-4ed0-a81d-5ab98dff6ec4&partNumber=1142". (worker 1)
2023-04-27 08:29:55,351 [8104] INFO: Uploading 'main/base/20230427T082428/data.tar', part '1144' (worker 1)
2023-04-27 08:29:55,395 [8092] ERROR: Error received from upload worker: Connection was closed before we received a valid response from endpoint URL: "http://minio-medium:9000/test-bucket/main/base/20230427T082428/data.tar?uploadId=3eab6ac7-f3eb-4ed0-a81d-5ab98dff6ec4&partNumber=1142".
2023-04-27 08:29:55,396 [8103] INFO: Got abort request: upload cancelled (worker 0)
2023-04-27 08:29:55,396 [8103] INFO: Skipping 'main/base/20230427T082428/data.tar', part '1145' (worker 0)
2023-04-27 08:29:55,397 [8104] INFO: Got abort request: upload cancelled (worker 1)
2023-04-27 08:29:55,397 [8104] INFO: Skipping 'main/base/20230427T082428/data.tar', part '1146' (worker 1)
2023-04-27 08:29:55,402 [8103] INFO: Upload process stopped (worker 0)
2023-04-27 08:29:55,403 [8104] INFO: Upload process stopped (worker 1)

^CProcess Process-1:
Process Process-2:
2023-04-27 08:12:50,116 [7918] ERROR: Backup failed uploading data (KeyboardInterrupt)
2023-04-27 08:12:50,117 [7918] INFO: Uploading 'main/base/20230427T080304/backup.info'
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
    util._exit_function()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 360, in _exit_function
    _run_finalizers()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 195, in _finalize_join
    thread.join()
  File "/usr/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
    util._exit_function()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 360, in _exit_function
    _run_finalizers()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/usr/lib/python3.8/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 195, in _finalize_join
    thread.join()
  File "/usr/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
2023-04-27 08:12:51,132 [7918] ERROR: Backup failed uploading backup.info file (An error occurred (XMinioStorageFull) when calling the PutObject operation: Storage backend has reached its minimum free disk threshold. Please delete a few objects to proceed.)

The python docs for multiprocessing state that deadlocks are possible if child processes have not consumed all items on the Queue, specifically:

This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.

In this case, we know the worker function has completed for both workers because logging the message Upload process stopped is the last thing they do before terminating. It's therefore possible that deadlock occurs when the parent joins the child processes during the termination process.

Adding some extra logging at L705 shows this is indeed the case:

2023-04-27 10:37:16,141 [8421] ERROR: Error received from upload worker: Connection was closed before we received a valid response from endpoint URL: "http://minio-medium:9000/test-bucket/main/base/20230427T103130/data.tar?uploadId=f7c25cc7-80c9-44fe-be9e-993c4c412046&partNumber=1142".
2023-04-27 10:37:16,141 [8421] INFO: Closing...
2023-04-27 10:37:16,141 [8427] INFO: Got abort request: upload cancelled (worker 0)
2023-04-27 10:37:16,141 [8427] INFO: Skipping 'main/base/20230427T103130/data.tar', part '1144' (worker 0)
2023-04-27 10:37:16,143 [8428] INFO: Got abort request: upload cancelled (worker 1)
2023-04-27 10:37:16,144 [8428] INFO: Skipping 'main/base/20230427T103130/data.tar', part '1145' (worker 1)
2023-04-27 10:37:16,144 [8421] INFO: Joining... <Process name='Process-1' pid=8427 parent=8421 started>
2023-04-27 10:37:16,147 [8427] INFO: Upload process stopped (worker 0)
2023-04-27 10:37:16,151 [8428] INFO: Upload process stopped (worker 1)
...

We can see the parent process joins the worker with pid 8427 and then hangs indefinitely.

Given the worker processes can add items to one of three queues, and given the error handling mechanism causes the parent to join the workers without draining those queues, this now seems quite unsurprising behaviour.

There are a couple of possible solutions here:

  1. Following the advice in the Python docs and using a multiprocessing.Manager to create the queues. Local experiments using this patch show it does indeed resolve the issue.
  2. Ensuring that all queues are drained before joining the child processes so that there are no unconsumed items left on the queues when terminating. I've tested locally with this patch and it also resolves the problem.

We may in fact want to do both of these things.

yanboer commented 1 year ago

Hello, I am not very familiar with Barman's code, so I can hardly give any specific suggestions.

May I politely ask if the Barman team has a general expectation for the timeline to fix this? I understand that fixing takes time and effort, just out of curiosity, I hope to have a time reference when convenient. If no estimate can be given at present, I fully understand.

Thank you to the Barman team for their dedication to this project. I look forward to a more stable and powerful Barman in the future.

mikewallace1979 commented 1 year ago

I expect we can get this fix into the next release which is currently planned for June 2023.

yanboer commented 1 year ago

Thank you for your response. I hope you have a wonderful day!