elemental-lf / benji

Benji Backup: A block based deduplicating backup software for Ceph RBD images, iSCSI targets, image files and block devices
https://benji-backup.me
Other
136 stars 43 forks source link

NBD disconnects with 4 TiB volume #75

Closed bbs2web closed 4 years ago

bbs2web commented 4 years ago

Hi,

I'm super happy with Benji, it's really fulfilled every aspect of our requirements but I'm trying to track down whether or not the following issue lies with Benji itself or NBD/kernel upstream.

We appear to be able to NBD mount every image, except for a 4 TiB file server volume.

My reference notes:

Benji backup host (NBD Server):

. /usr/local/benji/bin/activate;
benji ls | grep -e \+ -e volume -e customer-host-disk1;
benji nbd -r -a 0.0.0.0;

Client:

modprobe nbd;
nbd-client 192.168.1.61 -p 10809 -b 512 -t 10 -N lair-unix01-disk1-j5peds /dev/nbd4;
parted /dev/nbd4;

Server accepts connection but the connection very shortly thereafter starts timing out: image

Client connects but doesn't work: image

NBD client logs the following in /var/log/syslog:

Apr  9 17:21:19 kvm1b kernel: [240782.759434] block nbd4: Connection timed out
Apr  9 17:21:19 kvm1b kernel: [240782.760485] block nbd4: shutting down sockets
Apr  9 17:21:19 kvm1b kernel: [240782.760498] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.761390] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.762309] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.763230] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.764190] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.765163] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.766079] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.766970] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.767884] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.768795] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.769699] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.770621] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.771557] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.772714] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.773708] Dev nbd4: unable to read RDB block 0
Apr  9 17:21:19 kvm1b kernel: [240782.774663] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.775586] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.776504] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.777432] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.778353] blk_update_request: I/O error, dev nbd4, sector 24 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr  9 17:21:19 kvm1b kernel: [240782.779268] Buffer I/O error on dev nbd4, logical block 3, async page read
Apr  9 17:21:19 kvm1b kernel: [240782.780215]  nbd4: unable to read partition table
Apr  9 17:21:19 kvm1b kernel: [240782.780616] Dev nbd4: unable to read RDB block 0
Apr  9 17:21:19 kvm1b kernel: [240782.781733]  nbd4: unable to read partition table
Apr  9 17:21:19 kvm1b kernel: [240782.784036] Dev nbd4: unable to read RDB block 0
Apr  9 17:21:19 kvm1b kernel: [240782.785030]  nbd4: unable to read partition table
Apr  9 17:21:19 kvm1b kernel: [240782.785263] Dev nbd4: unable to read RDB block 0
Apr  9 17:21:19 kvm1b kernel: [240782.786221]  nbd4: unable to read partition table
Apr  9 17:21:19 kvm1b kernel: [240783.151622] Dev nbd4: unable to read RDB block 0
Apr  9 17:21:19 kvm1b kernel: [240783.152863]  nbd4: unable to read partition table
Apr  9 17:21:22 kvm1b kernel: [240785.472479] Dev nbd4: unable to read RDB block 0
Apr  9 17:21:22 kvm1b kernel: [240785.473640]  nbd4: unable to read partition table
Apr  9 17:21:24 kvm1b kernel: [240787.797318] blk_update_request: I/O error, dev nbd4, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
Apr  9 17:21:37 kvm1b kernel: [240800.399318] block nbd4: NBD_DISCONNECT
Apr  9 17:21:37 kvm1b kernel: [240800.399324] block nbd4: Send disconnect failed -32

Host is Proxmox, essentially Debian 10 (buster) with Ceph Nautilus 14.2.8 and Python 3.7.3. I had this on Benji 0.7.1 and now 0.8.0, after upgrading today.

After upgrading I purposefully interrupted an incremental backup, so that it would subsequently rescan the entire source drive. Once that backup was complete I scrubbed 100% of the resulting volume and ran a further incremental. The backup volume has a valid state.

pip list:

(benji) [admin@kvm1b ~]# pip list
Package            Version
------------------ ----------
alembic            1.4.2
argcomplete        1.11.1
benji              0.8.0
blinker            1.4
boto3              1.12.38
botocore           1.15.38
cachetools         4.1.0
Cerberus           1.3.2
certifi            2020.4.5.1
cffi               1.14.0
chardet            3.0.4
colorama           0.4.3
dateparser         0.7.4
diskcache          4.1.0
docutils           0.15.2
google-auth        1.13.1
idna               2.9
importlib-metadata 1.6.0
jmespath           0.9.5
kubernetes         10.1.0
Mako               1.1.2
MarkupSafe         1.1.1
oauthlib           3.1.0
pip                18.1
pkg-resources      0.0.0
prettytable        0.7.2
prometheus-client  0.7.1
psycopg2-binary    2.8.5
pyasn1             0.4.8
pyasn1-modules     0.2.8
pycparser          2.20
pycryptodome       3.9.7
pyparsing          2.4.7
python-dateutil    2.8.1
python-editor      1.0.4
pytz               2019.3
PyYAML             3.13
regex              2020.4.4
requests           2.23.0
requests-oauthlib  1.3.0
rsa                4.0
ruamel.yaml        0.15.100
s3transfer         0.3.3
semantic-version   2.8.4
setproctitle       1.1.10
setuptools         40.8.0
six                1.14.0
sortedcontainers   2.1.0
sparsebitfield     0.2.3
SQLAlchemy         1.3.16
structlog          20.1.0
tzlocal            2.0.0
urllib3            1.25.8
websocket-client   0.57.0
zipp               3.1.0
zstandard          0.13.0
bbs2web commented 4 years ago

Forgot to mention that the database schema change completed without warning and that we successfully ran an incremental after upgrading.

timeline:

elemental-lf commented 4 years ago

First of all, thanks for the kind feedback! Regarding your problem my gut-feeling would be that Benji is at fault but it's hard to say without more information. Unfortunately you've side stepped the issue template a little bit so your report is missing the debugging output from Benji. See https://github.com/elemental-lf/benji/blob/master/.github/ISSUE_TEMPLATE/bug_report.md for details on how to enable debugging output.

BTW you can directly filter the output of benji ls without needing to resort to grep. For example benji ls 'volume == "customer-host-disk1"' but more complex expressions using and, or and not boolean operators are also supported.

bbs2web commented 4 years ago

PS: Apologies, I never pressed enter in Github to post the following...

Looks like the read request does finally get processed by benji, almost a minute after the client disconnected...

Herewith debug output when running 'benji --log-level DEBUG nbd -r':

(benji) [admin@kvm1b ~]# benji --log-level DEBUG nbd -r
    INFO: $ /usr/local/benji/bin/benji --log-level DEBUG nbd -r
   DEBUG: commands.nbd(**{'bind_address': '127.0.0.1', 'bind_port': 10809, 'read_only': True})
   DEBUG: Using block hash BLAKE2b with kwargs {'digest_bits': 256}.
   DEBUG: Resolved schema for benji.io.file-v1: {'configuration': {'type': 'dict', 'required': True, 'empty': False, 'schema': {'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}}}}.
   DEBUG: Resolved schema for benji.io.rbd-v1: {'configuration': {'type': 'dict', 'required': True, 'empty': False, 'schema': {'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'cephConfigFile': {'type': 'string', 'empty': False, 'default': '/etc/ceph/ceph.conf'}, 'clientIdentifier': {'type': 'string', 'empty': False, 'default': 'admin'}, 'newImageFeatures': {'type': 'list', 'empty': False, 'default': ['RBD_FEATURE_LAYERING'], 'schema': {'type': 'string', 'regex': '^RBD_FEATURE_.*'}}}}}.
   DEBUG: Resolved schema for benji.io.rbdaio-v1: {'configuration': {'type': 'dict', 'required': True, 'empty': False, 'schema': {'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'cephConfigFile': {'type': 'string', 'empty': False, 'default': '/etc/ceph/ceph.conf'}, 'clientIdentifier': {'type': 'string', 'empty': False, 'default': 'admin'}, 'newImageFeatures': {'type': 'list', 'empty': False, 'default': ['RBD_FEATURE_LAYERING'], 'schema': {'type': 'string', 'regex': '^RBD_FEATURE_.*'}}}}}.
   DEBUG: Current database schema revision: 3d014d45493f.
   DEBUG: Expected database schema revision: 3d014d45493f.
   DEBUG: Resolved schema for benji.transform.zstd-v1: {'configuration': {'schema': {'level': {'type': 'integer', 'empty': False, 'default': 3, 'min': 1}, 'dictDataFile': {'type': 'string', 'empty': False}}}}.
   DEBUG: Resolved schema for benji.transform.aes_256_gcm-v1: {'configuration': {'required': True, 'schema': {'masterKey': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['kdfSalt', 'kdfIterations', 'password']}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['masterKey'], 'dependencies': ['kdfIterations', 'password']}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000, 'excludes': ['masterKey'], 'dependencies': ['kdfSalt', 'password']}, 'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8, 'excludes': ['masterKey'], 'dependencies': ['kdfSalt', 'kdfIterations']}}}}.
   DEBUG: Resolved schema for benji.storage.base-v1: {'configuration': {'type': 'dict', 'empty': False, 'schema': {'activeTransforms': {'type': 'list', 'empty': False, 'schema': {'type': 'string', 'empty': False}}, 'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousRemovals': {'type': 'integer', 'empty': False, 'min': 1, 'default': 5}, 'bandwidthRead': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'bandwidthWrite': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'consistencyCheckWrites': {'type': 'boolean', 'empty': False, 'default': False}, 'hmac': {'type': 'dict', 'empty': False, 'schema': {'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000}}}}}}.
   DEBUG: Resolved schema for benji.storage.base.ReadCache-v1: {'configuration': {'type': 'dict', 'empty': False, 'schema': {'activeTransforms': {'type': 'list', 'empty': False, 'schema': {'type': 'string', 'empty': False}}, 'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousRemovals': {'type': 'integer', 'empty': False, 'min': 1, 'default': 5}, 'bandwidthRead': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'bandwidthWrite': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'consistencyCheckWrites': {'type': 'boolean', 'empty': False, 'default': False}, 'hmac': {'type': 'dict', 'empty': False, 'schema': {'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000}}}, 'readCache': {'type': 'dict', 'empty': False, 'schema': {'directory': {'type': 'string', 'required': True, 'empty': False, 'dependencies': ['maximumSize', 'shards']}, 'maximumSize': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'shards']}, 'shards': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'maximumSize']}}}}, 'required': True}}.
   DEBUG: Resolved schema for benji.storage.s3-v1: {'configuration': {'type': 'dict', 'empty': False, 'schema': {'activeTransforms': {'type': 'list', 'empty': False, 'schema': {'type': 'string', 'empty': False}}, 'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousRemovals': {'type': 'integer', 'empty': False, 'min': 1, 'default': 5}, 'bandwidthRead': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'bandwidthWrite': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'consistencyCheckWrites': {'type': 'boolean', 'empty': False, 'default': False}, 'hmac': {'type': 'dict', 'empty': False, 'schema': {'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000}}}, 'readCache': {'type': 'dict', 'empty': False, 'schema': {'directory': {'type': 'string', 'required': True, 'empty': False, 'dependencies': ['maximumSize', 'shards']}, 'maximumSize': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'shards']}, 'shards': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'maximumSize']}}}, 'awsAccessKeyId': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['awsAccessKeyIdFile']}, 'awsAccessKeyIdFile': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['awsAccessKeyId']}, 'awsSecretAccessKey': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['awsSecretAccessKeyFile']}, 'awsSecretAccessKeyFile': {'type': 'string', 'required': True, 'empty': False, 'excludes': ['awsSecretAccessKey']}, 'regionName': {'type': 'string', 'empty': False}, 'endpointUrl': {'type': 'string', 'empty': False}, 'useSsl': {'type': 'boolean', 'empty': False, 'default': True}, 'addressingStyle': {'type': 'string', 'empty': False}, 'signatureVersion': {'type': 'string', 'empty': False}, 'bucketName': {'type': 'string', 'empty': False, 'required': True}, 'disableEncodingType': {'type': 'boolean', 'empty': False, 'default': False}}, 'required': True}}.
   DEBUG: Found existing storage radosgw with id 1.
   DEBUG: Using selector: EpollSelector
    INFO: Starting to serve NBD on 127.0.0.1:10809
    INFO: Incoming connection from 127.0.0.1:51374.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [127.0.0.1:51374]: opt=7, length=30, data=b'\x00\x00\x00\x18lair-unix01-disk1-hu5405\x00\x00'
   DEBUG: [127.0.0.1:51374]: opt=1, length=24, data=b'lair-unix01-disk1-hu5405'
    INFO: [127.0.0.1:51374] Negotiated export: lair-unix01-disk1-hu5405.
    INFO: [127.0.0.1:51374] Version lair-unix01-disk1-hu5405 has been opened.
    INFO: [127.0.0.1:51374] Export is read only.
   DEBUG: [127.0.0.1:51374]: cmd=0, cmd_flags=0, handle=6917529027657859072, offset=0, len=4096

going to run partprobe next

^C^C^C^C^C^C^C   DEBUG: Reading block lair-unix01-disk1-hu5405/0 0:4096.
   DEBUG: Initializing S3 session and resource for MainThread
    INFO: Active transforms for storage radosgw: zstd (zstd), aes_256_gcm (aes_256_gcm).
   DEBUG: Starting new HTTPS connection (1): backup1.lair.co.za:7480
   DEBUG: https://backup1.lair.co.za:7480 "GET /benji-kvm1/blocks/9d/71/0000000000000a5b-0000000000000001 HTTP/1.1" 200 527
   DEBUG: https://backup1.lair.co.za:7480 "GET /benji-kvm1/blocks/9d/71/0000000000000a5b-0000000000000001.meta HTTP/1.1" 200 469
   DEBUG: MainThread read data of uid a5b-1 in 0.411s
   ERROR: [127.0.0.1:51374] Connection lost
^C   DEBUG: KeyboardInterrupt exception occurred.

Client:

[admin@kvm1b ~]# nbd-client 127.0.0.1 -p 10809 -b 512 -t 10 -N lair-unix01-disk1-hu5405 /dev/nbd4;
Negotiation: ..size = 4194304MB
Connected /dev/nbd4
[admin@kvm1b ~]# partprobe;
Warning: Error fsyncing/closing /dev/nbd4: Input/output error
Warning: Error fsyncing/closing /dev/nbd4: Input/output error
bbs2web commented 4 years ago

Hrm... I don't actually have to do anything beyond running the client. A couple of seconds after connecting the client times out. Presume this is due to benji working with 'len=4096' when it should be 512?

Server:

(benji) [admin@kvm1b ~]# benji --log-level DEBUG nbd -r
    INFO: $ /usr/local/benji/bin/benji --log-level DEBUG nbd -r
   DEBUG: commands.nbd(**{'bind_address': '127.0.0.1', 'bind_port': 10809, 'read_only': True})
   DEBUG: Using block hash BLAKE2b with kwargs {'digest_bits': 256}.
   DEBUG: Resolved schema for benji.io.file-v1: {'configuration': {'type': 'dict', 'required': True, 'empty': False, 'schema': {'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}}}}.
   DEBUG: Resolved schema for benji.io.rbd-v1: {'configuration': {'type': 'dict', 'required': True, 'empty': False, 'schema': {'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'cephConfigFile': {'type': 'string', 'empty': False, 'default': '/etc/ceph/ceph.conf'}, 'clientIdentifier': {'type': 'string', 'empty': False, 'default': 'admin'}, 'newImageFeatures': {'type': 'list', 'empty': False, 'default': ['RBD_FEATURE_LAYERING'], 'schema': {'type': 'string', 'regex': '^RBD_FEATURE_.*'}}}}}.
   DEBUG: Resolved schema for benji.io.rbdaio-v1: {'configuration': {'type': 'dict', 'required': True, 'empty': False, 'schema': {'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'cephConfigFile': {'type': 'string', 'empty': False, 'default': '/etc/ceph/ceph.conf'}, 'clientIdentifier': {'type': 'string', 'empty': False, 'default': 'admin'}, 'newImageFeatures': {'type': 'list', 'empty': False, 'default': ['RBD_FEATURE_LAYERING'], 'schema': {'type': 'string', 'regex': '^RBD_FEATURE_.*'}}}}}.
   DEBUG: Current database schema revision: 3d014d45493f.
   DEBUG: Expected database schema revision: 3d014d45493f.
   DEBUG: Resolved schema for benji.transform.zstd-v1: {'configuration': {'schema': {'level': {'type': 'integer', 'empty': False, 'default': 3, 'min': 1}, 'dictDataFile': {'type': 'string', 'empty': False}}}}.
   DEBUG: Resolved schema for benji.transform.aes_256_gcm-v1: {'configuration': {'required': True, 'schema': {'masterKey': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['kdfSalt', 'kdfIterations', 'password']}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['masterKey'], 'dependencies': ['kdfIterations', 'password']}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000, 'excludes': ['masterKey'], 'dependencies': ['kdfSalt', 'password']}, 'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8, 'excludes': ['masterKey'], 'dependencies': ['kdfSalt', 'kdfIterations']}}}}.
   DEBUG: Resolved schema for benji.storage.base-v1: {'configuration': {'type': 'dict', 'empty': False, 'schema': {'activeTransforms': {'type': 'list', 'empty': False, 'schema': {'type': 'string', 'empty': False}}, 'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousRemovals': {'type': 'integer', 'empty': False, 'min': 1, 'default': 5}, 'bandwidthRead': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'bandwidthWrite': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'consistencyCheckWrites': {'type': 'boolean', 'empty': False, 'default': False}, 'hmac': {'type': 'dict', 'empty': False, 'schema': {'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000}}}}}}.
   DEBUG: Resolved schema for benji.storage.base.ReadCache-v1: {'configuration': {'type': 'dict', 'empty': False, 'schema': {'activeTransforms': {'type': 'list', 'empty': False, 'schema': {'type': 'string', 'empty': False}}, 'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousRemovals': {'type': 'integer', 'empty': False, 'min': 1, 'default': 5}, 'bandwidthRead': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'bandwidthWrite': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'consistencyCheckWrites': {'type': 'boolean', 'empty': False, 'default': False}, 'hmac': {'type': 'dict', 'empty': False, 'schema': {'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000}}}, 'readCache': {'type': 'dict', 'empty': False, 'schema': {'directory': {'type': 'string', 'required': True, 'empty': False, 'dependencies': ['maximumSize', 'shards']}, 'maximumSize': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'shards']}, 'shards': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'maximumSize']}}}}, 'required': True}}.
   DEBUG: Resolved schema for benji.storage.s3-v1: {'configuration': {'type': 'dict', 'empty': False, 'schema': {'activeTransforms': {'type': 'list', 'empty': False, 'schema': {'type': 'string', 'empty': False}}, 'simultaneousWrites': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousReads': {'type': 'integer', 'empty': False, 'min': 1, 'default': 3}, 'simultaneousRemovals': {'type': 'integer', 'empty': False, 'min': 1, 'default': 5}, 'bandwidthRead': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'bandwidthWrite': {'type': 'integer', 'empty': False, 'min': 0, 'default': 0}, 'consistencyCheckWrites': {'type': 'boolean', 'empty': False, 'default': False}, 'hmac': {'type': 'dict', 'empty': False, 'schema': {'password': {'type': 'string', 'empty': False, 'required': True, 'minlength': 8}, 'kdfSalt': {'type': 'string', 'empty': False, 'required': True}, 'kdfIterations': {'type': 'integer', 'empty': False, 'required': True, 'min': 1000}}}, 'readCache': {'type': 'dict', 'empty': False, 'schema': {'directory': {'type': 'string', 'required': True, 'empty': False, 'dependencies': ['maximumSize', 'shards']}, 'maximumSize': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'shards']}, 'shards': {'type': 'integer', 'required': True, 'min': 1, 'dependencies': ['directory', 'maximumSize']}}}, 'awsAccessKeyId': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['awsAccessKeyIdFile']}, 'awsAccessKeyIdFile': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['awsAccessKeyId']}, 'awsSecretAccessKey': {'type': 'string', 'empty': False, 'required': True, 'excludes': ['awsSecretAccessKeyFile']}, 'awsSecretAccessKeyFile': {'type': 'string', 'required': True, 'empty': False, 'excludes': ['awsSecretAccessKey']}, 'regionName': {'type': 'string', 'empty': False}, 'endpointUrl': {'type': 'string', 'empty': False}, 'useSsl': {'type': 'boolean', 'empty': False, 'default': True}, 'addressingStyle': {'type': 'string', 'empty': False}, 'signatureVersion': {'type': 'string', 'empty': False}, 'bucketName': {'type': 'string', 'empty': False, 'required': True}, 'disableEncodingType': {'type': 'boolean', 'empty': False, 'default': False}}, 'required': True}}.
   DEBUG: Found existing storage radosgw with id 1.
   DEBUG: Using selector: EpollSelector
    INFO: Starting to serve NBD on 127.0.0.1:10809
    INFO: Incoming connection from 127.0.0.1:54074.
   DEBUG: Client requested NBD_FLAG_NO_ZEROES.
   DEBUG: [127.0.0.1:54074]: opt=7, length=30, data=b'\x00\x00\x00\x18lair-unix01-disk1-hu5405\x00\x00'
   DEBUG: [127.0.0.1:54074]: opt=1, length=24, data=b'lair-unix01-disk1-hu5405'
    INFO: [127.0.0.1:54074] Negotiated export: lair-unix01-disk1-hu5405.
    INFO: [127.0.0.1:54074] Version lair-unix01-disk1-hu5405 has been opened.
    INFO: [127.0.0.1:54074] Export is read only.
   DEBUG: [127.0.0.1:54074]: cmd=0, cmd_flags=0, handle=2305843009230471168, offset=0, len=4096
<client times out here, about 1-2 minutes later the server appear to read sectors 0-4096:>
   DEBUG: Reading block lair-unix01-disk1-hu5405/0 0:4096.
   DEBUG: Initializing S3 session and resource for MainThread
    INFO: Active transforms for storage radosgw: zstd (zstd), aes_256_gcm (aes_256_gcm).
   DEBUG: Starting new HTTPS connection (1): backup1.lair.co.za:7480
   DEBUG: https://backup1.lair.co.za:7480 "GET /benji-kvm1/blocks/9d/71/0000000000000a5b-0000000000000001 HTTP/1.1" 200 527
   DEBUG: https://backup1.lair.co.za:7480 "GET /benji-kvm1/blocks/9d/71/0000000000000a5b-0000000000000001.meta HTTP/1.1" 200 469
   DEBUG: MainThread read data of uid a5b-1 in 0.101s
   ERROR: [127.0.0.1:54074] Connection lost
^C^C^C^C^C^C^C^C^C^C   DEBUG: KeyboardInterrupt exception occurred.
Traceback (most recent call last):
  File "/usr/local/benji/lib/python3.7/site-packages/benji/scripts/benji.py", line 340, in main
    func(**func_args)
  File "/usr/local/benji/lib/python3.7/site-packages/benji/commands.py", line 439, in nbd
    benji_obj.close()
  File "/usr/local/benji/lib/python3.7/site-packages/benji/benji.py", line 1070, in close
    self._database_backend.close()
  File "/usr/local/benji/lib/python3.7/site-packages/benji/database.py", line 1241, in close
    self._session.commit()
  File "/usr/local/benji/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1036, in commit
    self.transaction.commit()
  File "/usr/local/benji/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 513, in commit
    self._remove_snapshot()
  File "/usr/local/benji/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 394, in _remove_snapshot
    for s in self.session.identity_map.all_states():
  File "/usr/local/benji/lib/python3.7/site-packages/sqlalchemy/orm/identity.py", line 214, in all_states
    return list(self._dict.values())
KeyboardInterrupt
   ERROR: KeyboardInterrupt exception occurred.

Client:

[admin@kvm1b ~]# modprobe nbd
[admin@kvm1b ~]# nbd-client 127.0.0.1 -p 10809 -b 512 -t 10 -N lair-unix01-disk1-hu5405 /dev/nbd4;
Negotiation: ..size = 4194304MB
Connected /dev/nbd4

Client /var/log/syslog:

Apr 14 15:13:13 kvm1b kernel: [665092.473920] block nbd4: Connection timed out
Apr 14 15:13:13 kvm1b kernel: [665092.475019] block nbd4: shutting down sockets
Apr 14 15:13:13 kvm1b kernel: [665092.475030] print_req_error: 42 callbacks suppressed
Apr 14 15:13:13 kvm1b kernel: [665092.475032] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.475922] buffer_io_error: 40 callbacks suppressed
Apr 14 15:13:13 kvm1b kernel: [665092.475923] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.476832] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.477734] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.478674] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.479572] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.480454] ldm_validate_partition_table(): Disk read failed.
Apr 14 15:13:13 kvm1b kernel: [665092.480467] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.481359] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.482284] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.483193] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.484087] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.484983] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.485901] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.486810] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.487702] Dev nbd4: unable to read RDB block 0
Apr 14 15:13:13 kvm1b kernel: [665092.488635] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.489520] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.490430] blk_update_request: I/O error, dev nbd4, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.491330] Buffer I/O error on dev nbd4, logical block 0, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.492291] blk_update_request: I/O error, dev nbd4, sector 24 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 14 15:13:13 kvm1b kernel: [665092.493212] Buffer I/O error on dev nbd4, logical block 3, async page read
Apr 14 15:13:13 kvm1b kernel: [665092.494274]  nbd4: unable to read partition table
Apr 14 15:13:13 kvm1b kernel: [665092.494712] ldm_validate_partition_table(): Disk read failed.
Apr 14 15:13:13 kvm1b kernel: [665092.494887] Dev nbd4: unable to read RDB block 0
Apr 14 15:13:13 kvm1b kernel: [665092.496062]  nbd4: unable to read partition table
Apr 14 15:13:13 kvm1b kernel: [665092.498706] ldm_validate_partition_table(): Disk read failed.
Apr 14 15:13:13 kvm1b kernel: [665092.498750] Dev nbd4: unable to read RDB block 0
Apr 14 15:13:13 kvm1b kernel: [665092.499753]  nbd4: unable to read partition table
Apr 14 15:13:13 kvm1b kernel: [665092.500065] ldm_validate_partition_table(): Disk read failed.
Apr 14 15:13:13 kvm1b kernel: [665092.500104] Dev nbd4: unable to read RDB block 0
Apr 14 15:13:13 kvm1b kernel: [665092.501101]  nbd4: unable to read partition table
bbs2web commented 4 years ago

Ahh.... So I presume benji simply needs time to build the block assignment map for the image. If I increase '-t 10' to '-t 600' the client stays connected long enough for benji to feed it the requested data.

Herewith the entries from the client after I logged in, mapped the volume using nbd-client and it then read the partition table and created /dev/nbd4p1:

Apr 14 15:24:45 kvm1b systemd[1]: Started Session 6998 of user admin.
Apr 14 15:24:57 kvm1b systemd-udevd[784]: nbd4: Worker [1146049] processing SEQNUM=511748 is taking a long time
Apr 14 15:25:23 kvm1b kernel: [665822.722196]  nbd4: p1
elemental-lf commented 4 years ago

Nice find! I hadn't thought about that but it makes a lot of sense. I'll add a note to the documentation.

But I'm still curious why this takes so long. Looking at the source code there should be a single indexed database read between DEBUG: [127.0.0.1:54074]: cmd=0, cmd_flags=0, handle=2305843009230471168, offset=0, len=4096 and DEBUG: Reading block lair-unix01-disk1-hu5405/0 0:4096. and nothing much else.

elemental-lf commented 4 years ago

I've pushed https://github.com/elemental-lf/benji/commit/095fc919dd337b37617c5c432879b38d057cb6e1 which includes a small documentation update. I'm still not sure how the large delay occurs.