coreos / torus

Torus Distributed Storage
https://coreos.com/blog/torus-distributed-storage-by-coreos.html
Apache License 2.0
1.77k stars 172 forks source link

torusblk flexprepvol: mkfs failed (race?) #447

Open frozenice opened 7 years ago

frozenice commented 7 years ago

Env: Running a pod with a volume via flex volume plugin in Kubernetes 1.5.1.

For some reason flexprepvol is unable to format the device, thus failing the whole mount process and the pod. Tried with v0.1.2 and b783b16 (latest master at this time) on Ubuntu Server 16.04 and 16.10.

This was in the logs:

torus[12325]: mke2fs 1.43.3 (04-Sep-2016)
torus[12325]: mkfs.ext4: Device size reported to be zero.  Invalid partition specified, or
torus[12325]:         partition table wasn't reread after running fdisk, due to
torus[12325]:         a modified partition being busy and in use.  You may need to reboot
torus[12325]:         to re-read your partition table.

Which is the exact message I get, when trying to format an unattached /dev/nbd*.

Manually mounting the device via torusblk and running mkfs myself worked, so I added a timeout of 5 seconds (time.Sleep) just before sysd := connectSystemd() in flex.go#mountAction and used the newly compiled binary as my flex plugin. This worked!

So I'm guessing that there is a race condition between attach and mount / flexprepvol where the device needs a little time to be fully initialized. Kubernetes tried too quickly to mount the volume after attach.

frozenice commented 7 years ago

Tried it with 2 seconds instead and it failed, don't know if it was related or another error. Back to 5 seconds.

Also I was getting these for my previously working volume (now unusable):

W | distributor: remote asking for non-existent block: br c : 18 : 1

Deleteing and creating a new one worked. Will file another issue if it happens again.

I'm also getting some of these every few minutes. Don't know if related, always been there IIRC:

W | torus: couldn't register heartbeat: rpc error: code = 4 desc = context deadline exceeded
W | torus: couldn't update peerlist: rpc error: code = 4 desc = context deadline exceeded