ceph / ceph-container

Docker files and images to run Ceph in containers
Apache License 2.0
1.32k stars 523 forks source link

Centos 7 Docker OSD issue #875

Closed nsweetman closed 4 years ago

nsweetman commented 6 years ago

This is a new dev deployment. ceph/daemon mon when perfectly fine.

However having issues with ceph/daemon osd docker run -d --pid=host --net=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ --privileged=true -e OSD_FORCE_ZAP=1 -e OSD_DEVICE=/dev/sda ceph/daemon osd

I have tried

dd if=/dev/zero of=/dev/sda bs=4096k cout=100

followed by

docker run -d --privileged=true -v /dev/:/dev/ -e OSD_DEVICE=/dev/sda ceph/daemon zap_device

It creates the partitions sda1 thru sda4 and then the container stops.

ceph_osd_log.txt

leseb commented 6 years ago

2018-01-07 08:28:35.135660 7f159a544e00 -1 bluestore(/var/lib/ceph/tmp/mnt.6dDVRw/block) _check_or_set_bdev_label bdev /var/lib/ceph/tmp/mnt.6dDVRw/block fsid 55448ead-b9c8-4c67-94cd-a4b04c1f07cd does not match our fsid 9d5f7e61-5580-425d-b2fc-0895deee92c8 This looks like a leftover to me. Make sure to wipe the device properly before running the command; Try to increase the size of the dd.

nsweetman commented 6 years ago

I thought so too, however I did a fresh install and used Ubuntu 16.04 LTS instead of Centos 7 and used /dev/sdb and /dev/sdc which were fresh out of the box drives.

I was finally able to get a OSD container running on attached to /dev/sdb after extending the wipe to dd if=/dev/zero of=/dev/sdb bs=1M count=1000 and then zapping it.

However the second OSD container runs into the same issue with /dev/sdc

This makes me wonder if the wipe really fixed it the first time.

Should I be removing a failed container startup or just restart it after a fresh wipe?

leseb commented 6 years ago

Yes, you should.

nsweetman commented 6 years ago

I should what? Wipe and rebuild a failed container setup or simply restart?

leseb commented 6 years ago

Sorry remove the failed container, wipe the disk and try again.

nsweetman commented 6 years ago

ceph_osd_log_2.txt removed the container ran dd ran zap

new error.

r0ss3 commented 6 years ago

I have the exact same error using ceph/daemon:latest (so ubuntu 16.04)

mount: Mounting /dev/sdd1 on /var/lib/ceph/tmp/mnt.dvU2s5 with options noatime,inode64
command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sdd1 /var/lib/ceph/tmp/mnt.dvU2s5
activate: Cluster uuid is e6615470-b2e3-4d21-b4a3-cb94097f0d59
command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
activate: Cluster name is ceph
activate: OSD uuid is 73e1f914-8c5e-4f37-8962-13dfa5adce7f
allocate_osd_id: Allocating OSD id...
command: Running command: /usr/bin/ceph-authtool --gen-print-key
__init__: stderr
command_with_stdin: Running command with stdin: ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 73e1f914-8c5e-4f37-8962-13dfa5adce7f
command_with_stdin: 1

command_check_call: Running command: /usr/bin/ceph-authtool /var/lib/ceph/tmp/mnt.dvU2s5/keyring --create-keyring --name osd.1 --add-key AQAtdVRaOhkEERAApP3rHrXorFWflG4srAN4lg==
creating /var/lib/ceph/tmp/mnt.dvU2s5/keyring
added entity osd.1 auth auth(auid = 18446744073709551615 key=AQAtdVRaOhkEERAApP3rHrXorFWflG4srAN4lg== with 0 caps)
command: Running command: /bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.dvU2s5/keyring
command: Running command: /bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.dvU2s5/whoami.271.tmp
activate: OSD id is 1
activate: Initializing OSD...
command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/tmp/mnt.dvU2s5/activate.monmap
got monmap epoch 3
command_check_call: Running command: /usr/bin/ceph-osd --cluster ceph --mkfs -i 1 --monmap /var/lib/ceph/tmp/mnt.dvU2s5/activate.monmap --osd-data /var/lib/ceph/tmp/mnt.dvU2s5 --osd-uuid 73e1f914-8c5e-4f37-8962-13dfa5adce7f --setuser ceph --setgroup disk
2018-01-09 07:54:21.821739 7f838612ee00 -1 bluestore(/var/lib/ceph/tmp/mnt.dvU2s5/block) _check_or_set_bdev_label bdev /var/lib/ceph/tmp/mnt.dvU2s5/block fsid 24e6c6b7-4b92-4452-abb7-bbd253a901be does not match our fsid 73e1f914-8c5e-4f37-8962-13dfa5adce7f
2018-01-09 07:54:22.075363 7f838612ee00 -1 bluestore(/var/lib/ceph/tmp/mnt.dvU2s5) mkfs fsck found fatal error: (5) Input/output error
2018-01-09 07:54:22.075414 7f838612ee00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
2018-01-09 07:54:22.075512 7f838612ee00 -1 ^[[0;31m ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.dvU2s5: (5) Input/output error^[[0m
mount_activate: Failed to activate
unmount: Unmounting /var/lib/ceph/tmp/mnt.dvU2s5
command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.dvU2s5
/usr/lib/python2.7/dist-packages/ceph_disk/main.py:5677: UserWarning:
*******************************************************************************
This tool is now deprecated in favor of ceph-volume.
It is recommended to use ceph-volume for OSD deployments. For details see:

    http://docs.ceph.com/docs/master/ceph-volume/#migrating

*******************************************************************************

  warnings.warn(DEPRECATION_WARNING)
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 9, in <module>
    load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
    main(sys.argv[1:])
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
    args.func(args)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3761, in main_activate
    reactivate=args.reactivate,
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3524, in mount_activate
    (osd_id, cluster) = activate(path, activate_key_template, init)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3701, in activate
    keyring=keyring,
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3153, in mkfs
    '--setgroup', get_ceph_group(),
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 570, in command_check_call
    return subprocess.check_call(arguments)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '-i', u'1', '--monmap', '/var/lib/ceph/tmp/mnt.dvU2s5/activate.monmap', '--osd-data', '/var/lib/ceph/tmp/mnt.dvU2s5', '--osd-uuid', u'73e1f914-8c5e-4f37-8962-13dfa5adce7f', '--setuser', 'ceph', '--setgroup', 'disk']' returned non-zero exit status 1
r0ss3 commented 6 years ago

A full disk dd if=/dev/zero of=/dev/sda bs=4096k did fix the problem. It seems zapping a drive is not enough anymore?

leseb commented 6 years ago

Normally, the new zap in ceph-disk also does a 10MB dd, perhaps we should do 100MB to be sure.

nsweetman commented 6 years ago

It is worth a try.

I have tried every combination of drive, with the except of a virtual drive as my intent was to hyperconverge with the ceph docker providing the back end storage.

SAS, SATA, SSD, single drive, HBA, RAID, repurposed drives and new out of the box.

I keep bouncing between the 2 previously posted errors, and had only one success. It was with a new out of the box ssd but even it failed once before it took.

I have also tried on different hardware hp dl360 g7 and hp dl380 g8, and tried both Centos 7 and Ubuntu 16.04 LTS the results have always been the same .

The one things I am trying do is run 2 ceph/daemon osd docker instances one per drive.

From: Sébastien Han [mailto:notifications@github.com] Sent: Tuesday, January 9, 2018 2:42 AM To: ceph/ceph-container ceph-container@noreply.github.com Cc: Nathon Sweetman nsweetman@unifiednetworksllc.com; Author author@noreply.github.com Subject: Re: [ceph/ceph-container] Centos 7 Docker OSD issue (#875)

Normally, the new zap in ceph-disk also does a 10MB dd, perhaps we should do 100MB to be sure.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fceph%2Fceph-container%2Fissues%2F875%23issuecomment-356247584&data=02%7C01%7Cnsweetman%40unifiednetworksllc.com%7Cbe9bd0f0b8d14d6cf37208d5574dab70%7C34193c901efb48e8a1db92b6f7c488aa%7C0%7C0%7C636510913476638496&sdata=GbQJYSjAXGUQGxEZWQ5lzL2NT3qg5DyL2KDOnXkxFWo%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAhjvL-cr9WpZhl-07u4vuzMWuN-88u-gks5tI0KPgaJpZM4RVlc-&data=02%7C01%7Cnsweetman%40unifiednetworksllc.com%7Cbe9bd0f0b8d14d6cf37208d5574dab70%7C34193c901efb48e8a1db92b6f7c488aa%7C0%7C0%7C636510913476638496&sdata=t%2F6FFCbDazZlo63A%2FfR4DzZ4fEkQ0ddZBnoKFyjRrnE%3D&reserved=0.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.