TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.58k stars 247 forks source link

poweroff potentially harmfull when using special vdev for separating metadata #842

Open GernotS opened 5 years ago

GernotS commented 5 years ago

Creating a pool with a "special" vdev featuring power-loss protection can leave pool corrupted after comand "poweroff" Steps to reproduce: Boot smartos in rescue mode (to speed things up). zpool create zones mirror c2t0d0 c2t2d0 special mirror c2t2d0 c2t3d0 c2t0d0 c2t2d0 are simple HDDs c2t2d0 c2t3d0 are enterprise class SSDs featuring power loss protection and are being treated as such in sd.conf using cache-nonvolatile:true. No need to write data to it, do a "reboot". Pool can be imported, no issues, as expected. Again no need to write data to it, do a "poweroff". After power on, pool reports corrupted metadata and cannot be imported. All vdevs online.

Workaround: set cache-nonvolatile:false in sd.conf for above SSDs, potentially giving up some performance.

GernotS commented 5 years ago

Those SSDs are OCZ INTREPID 3800 100GB

KodyKantor commented 5 years ago

Thanks @GernotS. Which version(s) of SmartOS have you encountered this issue on?

GernotS commented 5 years ago

Latest version, built myself.

Von: Kody A Kantor [mailto:notifications@github.com] Gesendet: Donnerstag, 25. Juli 2019 22:39 An: joyent/smartos-live smartos-live@noreply.github.com Cc: GernotS gernot.strasser@freenet.de; Mention mention@noreply.github.com Betreff: Re: [joyent/smartos-live] poweroff potentially harmfull when using special vdev for separating metadata (#842)

Thanks @GernotS https://github.com/GernotS . Which version(s) of SmartOS have you encountered this issue on?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joyent/smartos-live/issues/842?email_source=notifications&email_token=ACAGJ4QH7PJIBUHCK7J6CFLQBIFOPA5CNFSM4IGMP2F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD22WTIY#issuecomment-515205539 , or mute the thread https://github.com/notifications/unsubscribe-auth/ACAGJ4QRC5OZ6OKCSCVEYFDQBIFOPANCNFSM4IGMP2FQ .

KodyKantor commented 5 years ago

@GernotS Thanks for that information.

I've been trying to reproduce this issue locally on an SSD-based system and haven't had any luck. I have a few more questions that might help us debug this on our end.

Does the corruption occur when you're not using a special vdev? Have you experienced this problem on a platform version from before the special vdev type was added?

Have you used these SSDs in a power loss situation previously without leading to a corrupt pool? What sort of controller manages your SSDs? Is it an HBA set to IT mode?

GernotS commented 5 years ago

Hello Kody,

those SSDs have proven to be very reliable devices for SLOG, but I never had a power failure, so I cant say much about that.

Without special devices I never had such loss issues, with no version of Smartos.

I am using a Poweredge T30 and its internal SATA ports.

Did you try the setting of cache-nonvolatile:true in sd.conf as well?

Regards

Gernot

Von: Kody A Kantor [mailto:notifications@github.com] Gesendet: Samstag, 3. August 2019 00:21 An: joyent/smartos-live Cc: GernotS; Mention Betreff: Re: [joyent/smartos-live] poweroff potentially harmfull when using special vdev for separating metadata (#842)

@GernotS https://github.com/GernotS Thanks for that information.

I've been trying to reproduce this issue locally on an SSD-based system and haven't had any luck. I have a few more questions that might help us debug this on our end.

Does the corruption occur when you're not using a special vdev? Have you experienced this problem on a platform version from before the special vdev type was added?

Have you used these SSDs in a power loss situation previously without leading to a corrupt pool? What sort of controller manages your SSDs? Is it an HBA set to IT mode?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joyent/smartos-live/issues/842?email_source=notifications&email_token=ACAGJ4QZHNIE6SXZB4DWS73QCSXNNA5CNFSM4IGMP2F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3O6YJQ#issuecomment-517860390 , or mute the thread https://github.com/notifications/unsubscribe-auth/ACAGJ4VHGLRKZG5O746DMGTQCSXNNANCNFSM4IGMP2FQ . https://github.com/notifications/beacon/ACAGJ4U7T55C3AL4TQST2ELQCSXNNA5CNFSM4IGMP2F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3O6YJQ.gif

KodyKantor commented 5 years ago

Hi Gernot. Thanks for the information.

Did you try the setting of cache-nonvolatile:true in sd.conf as well?

Yes, I built a PI with a modified sd.conf and verified that each of the SSDs in my system had the un_f_suppress_cache_flush bit set to true. I tried a few things: using poweroff, forcibly powering the system off via the IPMI console, and doing both of those things while the system was performing IO.

For my own notes, poweroff will send a sync() unless the -n flag is used. This will wait for spa_sync to finish before returning. Setting cache-nonvolatile to true means that any SCSI SYNCHRONIZE_CACHE commands will not be sent to the underlying SSDs.

spa_sync will try to send SYNCHRONIZE_CACHE commands during spa_sync_rewrite_vdev_config via vdev_config_sync. vdev_config_sync. A lot of this code was refactored somewhat recently in illumos#10853, but it doesn't appear that the logic changed all that much.

Would it be possible for you to send a crash dump our way? I'd like to check a bunch of state in the sd driver, and it could be that ZFS is holding onto some state that might help us get to the bottom of this.

A couple more questions: Does this corruption only occur when using mirrored vdevs? Does the corruption only occur when the pool with 'special' devices is the zones pool?

If you can reproduce the corruption using a pool other than the zones pool then I think you could get a crash dump fairly easily. You could create the zones pool using a disk or something with cache-nonvolatile: false, and then create a second pool, testpool, that has special vdevs. poweroff -d should generate a crash dump that we could then take a look at.

If sending us a crash dump isn't something that you would like to do then I can send you DTrace or MDB commands to run and we can try to go from there.

Thanks for all the help, Gernot.

KodyKantor commented 5 years ago

Oh, and one more question. Does your suggested zpool create command contain a typo?

 zpool create zones mirror c2t0d0 c2t2d0 special mirror c2t2d0 c2t3d0

We noticed that c2t2d0 is specified twice here. I tried this locally and ZFS returned an error preventing me from performing this operation. I just wanted to make sure that all four of these are unique devices.

GernotS commented 5 years ago

Hello Kody,

I am sorry, the system went into production now (without cache-nonvolatile), so I cant do any more testing easily.

I did try without mirror ,same issue. Without special vdev I got no issues.

And that zpool create was a typo, this wouldn’t have worked at all.

Regards

Gernot

Von: Kody A Kantor [mailto:notifications@github.com] Gesendet: Montag, 5. August 2019 20:43 An: joyent/smartos-live Cc: GernotS; Mention Betreff: Re: [joyent/smartos-live] poweroff potentially harmfull when using special vdev for separating metadata (#842)

Hi Gernot. Thanks for the information.

Did you try the setting of cache-nonvolatile:true in sd.conf as well?

Yes, I built a PI with a modified sd.conf and verified that each of the SSDs in my system had the un_f_suppress_cache_flush bit set to true. I tried a few things: using poweroff, forcibly powering the system off via the IPMI console, and doing both of those things while the system was performing IO.

For my own notes, poweroff will send a sync() unless the -n flag is used. This will wait for spa_sync to finish before returning. Setting cache-nonvolatile to true means that any SCSI SYNCHRONIZE_CACHE commands will not be sent to the underlying SSDs.

spa_sync will try to send SYNCHRONIZE_CACHE commands during spa_sync_rewrite_vdev_config via vdev_config_sync. vdev_config_sync. A lot of this code was refactored somewhat recently in illumos#10853, but it doesn't appear that the logic changed all that much.

Would it be possible for you to send a crash dump our way? I'd like to check a bunch of state in the sd driver, and it could be that ZFS is holding onto some state that might help us get to the bottom of this.

A couple more questions: Does this corruption only occur when using mirrored vdevs? Does the corruption only occur when the pool with 'special' devices is the zones pool?

If you can reproduce the corruption using a pool other than the zones pool then I think you could get a crash dump fairly easily. You could create the zones pool using a disk or something with cache-nonvolatile: false, and then create a second pool, testpool, that has special vdevs. poweroff -d should generate a crash dump that we could then take a look at.

If sending us a crash dump isn't something that you would like to do then I can send you DTrace or MDB commands to run and we can try to go from there.

Thanks for all the help, Gernot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/joyent/smartos-live/issues/842?email_source=notifications&email_token=ACAGJ4VRPVR2TMLHVDCXCELQDBYC7A5CNFSM4IGMP2F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3SWURI#issuecomment-518351429 , or mute the thread https://github.com/notifications/unsubscribe-auth/ACAGJ4QY2OFKLUIMU2RSGK3QDBYC7ANCNFSM4IGMP2FQ . https://github.com/notifications/beacon/ACAGJ4SYYGUU2Y27IJA4BLLQDBYC7A5CNFSM4IGMP2F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3SWURI.gif

KodyKantor commented 5 years ago

Hi Gernot.

Thanks for the update. If you do run into any corruption issues feel free to drop a line here or the mailing list and we'll see about diving deeper. In the meantime I'll continue trying to reproduce the problem on a system or two I have locally.