NixOS / nixops-hetzner

GNU Lesser General Public License v3.0
48 stars 14 forks source link

md_node_from_name failed #14

Open nh2 opened 7 years ago

nh2 commented 7 years ago

I was trying to deploy this Hetzner partition config for some glusterfs experiment of mine:

      deployment.hetzner.partitions = ''
        clearpart --all --initlabel --drives=sda,sdb

        part raid.brick1.a --ondisk=sda --size=3400000
        part raid.brick1.b --ondisk=sdb --size=3400000

        part raid.swap.a   --ondisk=sda --size=32000
        part raid.swap.b   --ondisk=sdb --size=32000

        part raid.root.a   --ondisk=sda --grow
        part raid.root.b   --ondisk=sdb --grow

        raid /              --level=1 --device=root           --fstype=ext4 --label=root           raid.root.a   raid.root.b
        raid swap           --level=1 --device=swap           --fstype=swap --label=swap           raid.swap.a   raid.swap.b
        raid ${brickFsPath} --level=1 --device=gluster-brick1 --fstype=ext4 --label=gluster-brick1 raid.brick1.a raid.brick1.b
      '';

and got this on 1 of the 3 machines:

.[up]
building Nix bootstrap installer... 
done. (/nix/store/4v16dw4gvm9ih3ki55gh8j1d6q6g7iaw-hetzner-nixops-installer/bin/hetzner-bootstrap)
creating nixbld group in rescue system... 
done.
checking if tmpfs in rescue system is large enough... 
yes: 15957 MB
copying bootstrap files to rescue system... 
done.
disabling potentially active LVM arrays... 
  No volume groups found
partitioning disks... 
Traceback (most recent call last):
  File "/nix/store/ni2js4lwp1w6l14azfcjlgwn2im38m1b-nixpart-0.4.1/bin/.nixpart-wrapped", line 166, in <module>
    main()
  File "/nix/store/ni2js4lwp1w6l14azfcjlgwn2im38m1b-nixpart-0.4.1/bin/.nixpart-wrapped", line 152, in main
    storage = ks.run(init=False)
  File "/nix/store/ni2js4lwp1w6l14azfcjlgwn2im38m1b-nixpart-0.4.1/lib/python2.7/site-packages/nixkickstart.py", line 1040, in run
    self.partition()
  File "/nix/store/ni2js4lwp1w6l14azfcjlgwn2im38m1b-nixpart-0.4.1/lib/python2.7/site-packages/nixkickstart.py", line 993, in partition
    self.storage.doIt()
  File "/nix/store/fl2pafsa1c4y6z4hnqv8235jpbj6wja1-blivet-0.17-1/lib/python2.7/site-packages/blivet/__init__.py", line 310, in doIt
    self.devicetree.processActions()
  File "/nix/store/fl2pafsa1c4y6z4hnqv8235jpbj6wja1-blivet-0.17-1/lib/python2.7/site-packages/blivet/devicetree.py", line 237, in processActions
    action.execute()
  File "/nix/store/fl2pafsa1c4y6z4hnqv8235jpbj6wja1-blivet-0.17-1/lib/python2.7/site-packages/blivet/deviceaction.py", line 272, in execute
    self.device.create()
  File "/nix/store/fl2pafsa1c4y6z4hnqv8235jpbj6wja1-blivet-0.17-1/lib/python2.7/site-packages/blivet/devices.py", line 791, in create
    self._postCreate()
  File "/nix/store/fl2pafsa1c4y6z4hnqv8235jpbj6wja1-blivet-0.17-1/lib/python2.7/site-packages/blivet/devices.py", line 3219, in _postCreate
    md_node = mdraid.md_node_from_name(self.name)
  File "/nix/store/fl2pafsa1c4y6z4hnqv8235jpbj6wja1-blivet-0.17-1/lib/python2.7/site-packages/blivet/devicelibs/mdraid.py", line 261, in md_node_from_name
    raise MDRaidError("md_node_from_name failed: %s" % e)
blivet.errors.MDRaidError: md_node_from_name failed: [Errno 2] No such file or directory: '/dev/md/gluster-brick1'

Deploying again from scrat made it not happen.

SSH'ing into that machine showed that /dev/md/gluster-brick1 does indeed exist.

I wonder if it's some race condition that the device file took some time to be created.

But I'm not even sure in which component this race would sit.

domenkozar commented 7 years ago

cc @aszlig

nh2 commented 7 years ago

Ah, looks like it isn't too rare, just got it again.