ceph / ceph-cookbook

Chef cookbooks for Ceph
Apache License 2.0
100 stars 113 forks source link

health HEALTH_WARN 64 pgs incomplete; 64 pgs stuck inactive; 64 pgs stuck unclean #187

Open abhishek6590 opened 9 years ago

abhishek6590 commented 9 years ago

Hi,

I am having an issue of ceph health - health HEALTH_WARN 64 pgs incomplete; 64 pgs stuck inactive; 64 pgs stuck unclean Please suggest me what should I check.

Thanks, Abhishek

hufman commented 9 years ago

That sounds like there aren't any OSD processes running and connected to the cluster. If you check the output of ceph osd tree, does it show that the cluster expects to have an OSD? If not, this means that the ceph-disk-prepare script didn't run, which comes from the ceph::osd recipe. If so, this means that the ceph::osd script ran and initialized an OSD, but for some reason that OSD didn't connect to the cluster. Check the OSD server to make sure the process is running, and then look at the logs in /var/log/ceph/ceph-osd* to see why the OSD isn't connecting.

abhishek6590 commented 9 years ago

Hi ceph osd tree is showing output -

ceph osd tree

id weight type name up/down reweight

-1 0.09 root default -2 0.09 host server3 0 0.09 osd.0 up 1

and logs are showing tail -f ceph-osd.0.log 2015-02-03 12:50:44.115354 7f0d0d1b7900 0 cls/hello/cls_hello.cc:271: loading cls_hello 2015-02-03 12:50:44.157671 7f0d0d1b7900 0 osd.0 4 crush map has features 1107558400, adjusting msgr requires for clients 2015-02-03 12:50:44.157682 7f0d0d1b7900 0 osd.0 4 crush map has features 1107558400 was 8705, adjusting msgr requires for mons 2015-02-03 12:50:44.157687 7f0d0d1b7900 0 osd.0 4 crush map has features 1107558400, adjusting msgr requires for osds 2015-02-03 12:50:44.157703 7f0d0d1b7900 0 osd.0 4 load_pgs 2015-02-03 12:50:44.201885 7f0d0d1b7900 0 osd.0 4 load_pgs opened 64 pgs 2015-02-03 12:50:44.212991 7f0d0d1b7900 -1 osd.0 4 set_disk_tp_priority(22) Invalid argument: osd_disk_thread_ioprio_class is but only the following values are allowed: idle, be or rt 2015-02-03 12:50:44.290354 7f0cfb587700 0 osd.0 4 ignoring osdmap until we have initialized 2015-02-03 12:50:44.290416 7f0cfb587700 0 osd.0 4 ignoring osdmap until we have initialized 2015-02-03 12:50:44.371616 7f0d0d1b7900 0 osd.0 4 done with init, starting boot process

Please suggest me.

Thanks,

hufman commented 9 years ago

Ah yes, you'll need at least 3 OSDs for Ceph to be happy and healthy. Depending on how your Crush map is configured, I forget the defaults, these OSDs will have to be on separate hosts.

zdubery commented 7 years ago

Hi

I am a bit confused by this statement. "you'll need at least 3 OSD's to be happy and healthy". I followed the instructions (here: http://docs.ceph.com/docs/hammer/start/quick-ceph-deploy/) and once I get to the command "ceph health", the response is: "health HEALTH_ERR 64 pgs incomplete; 64 pgs stuck inactive; 64 pgs stuck unclean". That is when I install it...

Ceph documentation clearly stated: "Change the default number of replicas in the Ceph configuration file from 3 to 2 so that Ceph can achieve an active + clean state with just two Ceph OSDs. Add the following line under the [global] section: osd pool default size = 2"

I have attempted this install at least 3 times now and the response is the same every time. I am running 1 admin node, 1 monitor and 2 osd's on 4 VirtualBox Ubuntu 14.04 LTS VM's within Ubuntu 16 (previous attempt was within Ubuntu 14).

The debug information is not very helpful at all. Ceph is also not writing to the /var/log/ceph/ location at all even after I set permissions sudo chmod ceph:root /var/log/ceph

ceph-deploy osd activate tells me that the osd's are active but ceph osd tree shows otherwise. (down)

The config is read from /etc/ceph/cep.conf all the time (even though I install everything from my-cluster directory) which is incorrect. When I ran the install, the config was created in /home/user/my-cluster/ceph.conf yet it reads it from /etc/ceph/cep.conf.

So I will attempt 3 OSD's now even though the site states otherwise...

Any suggestions would be very helpful.

Thanks,

zd

sweetie233 commented 7 years ago

Hi, I just have the same problem as yours, and I have reinstalled Ceph for more than 3 times. I'm really upset. Have you figured it out? Expect your suggestions.

zdubery commented 7 years ago

Hi

If you are using ext4 file system, you need to place this in config global section:

filestore xattr use omap

Restart and see if HEALTH_OK achieved.

Cheers

On 02 Dec 2016 17:30, "LostSoul007" notifications@github.com wrote:

Hi, I just have the same problem as yours, and I have reinstalled Ceph for more than 3 times. I'm really upset. Have you figured it out? Expect your suggestions.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ceph/ceph-cookbook/issues/187#issuecomment-264480885, or mute the thread https://github.com/notifications/unsubscribe-auth/AU-ifrx6HvfwyP4KiNKNeRr-TMsVlB-7ks5rEDmkgaJpZM4DaskH .

sweetie233 commented 7 years ago

Hi

First, thank you so much for your suggestion!!! My file system is ext4, and I just did the thing you suggested, but it seems to make no difference.

I reviewed the osd's log throughly and found the following words: osd.0 0 backend (filestore) is unable to support max object name[space] len osd.0 0 osd max object name len = 2048 osd.0 0 osd max object namespace len = 256 osd.0 0 (36) File name too long journal close /var/lib/ceph/osd/ceph-0/journal ** ERROR: osd init failed: (36) File name too long

Then I found this page: http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/

I just reinstalled Ceph again, and place the following words in config global section: osd_max_object_name_len = 256 osd_max_object_namespace_len = 64

It works!!! I'm so happy and I appreciate you reply very much!!!

Thanks again! Best wishes~

zdubery commented 7 years ago

Hi

You are welcome.

I am glad you solved it.

Best Wishes

Zayne

On 03 Dec 2016 14:36, "LostSoul007" notifications@github.com wrote:

Hi

First, thank you so much for your suggestion!!! My file system is ext4, and I just did the thing you suggested, but it seems to make no difference.

I reviewed the osd's log throughly and found the following words: osd.0 0 backend (filestore) is unable to support max object name[space] len osd.0 0 osd max object name len = 2048 osd.0 0 osd max object namespace len = 256 osd.0 0 (36) File name too long journal close /var/lib/ceph/osd/ceph-0/journal ** ERROR: osd init failed: (36) File name too long

Then I found this page: http://docs.ceph.com/docs/jewel/rados/configuration/ filesystem-recommendations/

I just reinstalled Ceph again, and place the following words in config global section: osd_max_object_name_len = 256 osd_max_object_namespace_len = 64

It works!!! I'm so happy and I appreciate you reply very much!!!

Thanks again! Best wishes~

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ceph/ceph-cookbook/issues/187#issuecomment-264636825, or mute the thread https://github.com/notifications/unsubscribe-auth/AU-ifp3r6BBzkWacMmp_yBc3BFqxEr7Vks5rEWIxgaJpZM4DaskH .

subhashchand commented 7 years ago

If you are using ext4 file system, you need to place this in config global section:

vim /etc/ceph/ceph.conf

osd_max_object_name_len = 256 osd_max_object_namespace_len = 64 http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/

ceph status

getarz4u15ster commented 7 years ago

I'm having the same problem, however I am using the preferred xfs filesystem.. Any suggestions?

[From monitor node i get the following] HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; no osds

[From OSD node] 2017-01-27 07:55:28.000882 7fde7846d700 0 -- :/429908835 >> ipaddress:6789/0 pipe(0x7fde74063f30 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fde7405c5a0).fault

[From Monitor node out of /var/log/ceph/ceph.log] 2017-01-27 06:47:11.121804 mon.0 ipaddress:6789/0 1 : cluster [INF] mon.oso-node1@0 won leader election with quorum 0 2017-01-27 06:47:11.121931 mon.0ipaddress:6789/0 2 : cluster [INF] monmap e1: 1 mons at {oso-node1=ipaddress:6789/0} 2017-01-27 06:47:11.122008 mon.0 ipaddress:6789/0 3 : cluster [INF] pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2017-01-27 06:47:11.122090 mon.0 ipaddress:6789/0 4 : cluster [INF] fsmap e1: 2017-01-27 06:47:11.122203 mon.0 ipaddress:6789/0 5 : cluster [INF] osdmap e1: 0 osds: 0 up, 0 in 2017-01-27 06:54:50.687322 mon.0 ipaddress:6789/0 1 : cluster [INF] mon.oso-node1@0 won leader election with quorum 0 2017-01-27 06:54:50.687415 mon.0 ipaddress:6789/0 2 : cluster [INF] monmap e1: 1 mons at {oso-node1=ipaddress:6789/0} 2017-01-27 06:54:50.687497 mon.0 ipaddress:6789/0 3 : cluster [INF] pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2017-01-27 06:54:50.687577 mon.0 ipaddress:6789/0 4 : cluster [INF] fsmap e1: 2017-01-27 06:54:50.687716 mon.0 ipaddress:6789/0 5 : cluster [INF] osdmap e1: 0 osds: 0 up, 0 in

swq499809608 commented 7 years ago

f_redirected e754) currently waiting for peered 2017-03-02 10:58:39.952422 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 324.251003 secs 2017-03-02 10:58:39.952444 osd.25 [WRN] slow request 240.250943 seconds old, received at 2017-03-02 10:54:39.701431: osd_op(client.512724.0:135407 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:40.091373 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 324.389960 secs 2017-03-02 10:58:40.091378 osd.27 [WRN] slow request 240.389941 seconds old, received at 2017-03-02 10:54:39.701397: osd_op(client.512724.0:135408 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:40.952740 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 325.251301 secs 2017-03-02 10:58:40.952791 osd.25 [WRN] slow request 240.243998 seconds old, received at 2017-03-02 10:54:40.708674: osd_op(client.36294.0:8895939 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:41.091613 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 325.390198 secs 2017-03-02 10:58:41.091619 osd.27 [WRN] slow request 240.382847 seconds old, received at 2017-03-02 10:54:40.708729: osd_op(client.36294.0:8895940 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:43.953496 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 328.252086 secs 2017-03-02 10:58:43.953517 osd.25 [WRN] slow request 240.022847 seconds old, received at 2017-03-02 10:54:43.930609: osd_op(client.36291.0:8893352 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:44.092310 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 328.390885 secs 2017-03-02 10:58:44.092315 osd.27 [WRN] slow request 240.161657 seconds old, received at 2017-03-02 10:54:43.930605: osd_op(client.36291.0:8893353 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:44.953818 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 329.252386 secs 2017-03-02 10:58:44.953827 osd.25 [WRN] slow request 240.251734 seconds old, received at 2017-03-02 10:54:44.702023: osd_op(client.512724.0:135415 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:45.092587 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 329.391155 secs 2017-03-02 10:58:45.092597 osd.27 [WRN] slow request 240.390484 seconds old, received at 2017-03-02 10:54:44.702049: osd_op(client.512724.0:135416 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:45.954085 osd.25 [WRN] 100 slow requests, 1 included below; oldest blocked for > 330.252673 secs 2017-03-02 10:58:45.954103 osd.25 [WRN] slow request 240.244915 seconds old, received at 2017-03-02 10:54:45.709129: osd_op(client.36294.0:8895947 97.84ada7c9 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered 2017-03-02 10:58:46.092838 osd.27 [WRN] 100 slow requests, 1 included below; oldest blocked for > 330.391422 secs 2017-03-02 10:58:46.092850 osd.27 [WRN] slow request 240.383640 seconds old, received at 2017-03-02 10:54:45.709160: osd_op(client.36294.0:8895948 97.31099063 (undecoded) ondisk+write+known_if_redirected e754) currently waiting for peered

ghost commented 6 years ago

After adding the following lines to /etc/ceph/ceph.conf file and reboot the system. Somehow, the issue still exists.

osd_max_object_name_len = 256 osd_max_object_namespace_len = 64

ceph status

cluster b3609cba-0b6d-4311-8aa3-6968c0e66f5e
 health HEALTH_WARN
        64 pgs degraded
        64 pgs stuck degraded
        64 pgs stuck unclean
        64 pgs stuck undersized
        64 pgs undersized
 monmap e1: 1 mons at {0=10.11.108.188:6789/0}
        election epoch 3, quorum 0 0
 osdmap e15: 2 osds: 2 up, 2 in
        flags sortbitwise,require_jewel_osds
  pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects
        69172 kB used, 3338 GB / 3338 GB avail
              _64 active+undersized+degraded
mosyang commented 6 years ago

I met those ext4 file system issue before. I tried below settings in ceph.conf but finally gave up.

osd_max_object_name_len = 256 osd_max_object_namespace_len = 64 osd check max object name len on startup = false

However, I follow this helpful document to deploy Ceph Jewel 10.2.9 on Ubuntu 16.04. Login to all OSD nodes and format the /dev/sdb partition with XFS file system. After that, I follow official document to deploy ceph on my ubuntu 16.04 servers. Everything works fine now.

Runomu commented 6 years ago

i have exactly same Problem with 14.04 LTS ext4. I tried almost everything and all suggestions above. But i'm still getting following on celp -s and next one on celp osd tree

health HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds 64 pgs stuck inactive 64 pgs stuck unclean

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0 root default 0 0 osd.0 down 0 1.00000

mattshma commented 6 years ago

After appended those lines into admin_node's ceph.conf:

osd max object name len = 256
osd max object namespace len = 64

then I think you should run ceph-deploy --overwrite-conf admin osd1 osd2 to deploy the changes to osd nodes. And you should make sure the user ceph has r permission of /etc/ceph/ceph.client.admin.keyring in the osd nodes.

alamintech commented 3 years ago

When my server reboot and then see error osds down and pgs inactive. Please help me. How can I solve this. This storage using for cloudstack primary storage.

image

Thanks.

alamintech commented 3 years ago

Please help me anyone. image

zdover23 commented 3 years ago

Does this look like your error?

https://tracker.ceph.com/issues/17722

On Mon, Sep 7, 2020 at 6:21 PM alamintech notifications@github.com wrote:

Please help me anyone. [image: image] https://user-images.githubusercontent.com/68062764/92364757-5a4fae00-f115-11ea-90ee-a61246a87297.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ceph/ceph-cookbook/issues/187#issuecomment-688154975, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALAZ46AN3QNEZNX6FWV6ZDSESJYRANCNFSM4A3KZEDQ .

alamintech commented 3 years ago

See but can't find solution for this

image

alamintech commented 3 years ago

After server reboot can't start osd service. Please help me any one.