Open lukemarsden opened 8 years ago
Hey @lukemarsden, we hit a couple bumps on the road here due to some limitations of the Azure Disk API and we weren't expecting those fixes for a while. Let me circle back with some of my colleagues in redmond to see whats up.
Hi @sedouard Any update on this? We're getting more and more requests for an Azure driver (I work at ClusterHQ). Thanks!
hey @ferrantim @lukemarsden I'm starting work today and this week again but I anticipate this as only preview/demo as there are some problems with our Disk API. I'm hoping that the implementation will highlight these problems for azure engineering. There are some improvements that are anticipated for these apis but not for a while unfortunately. Keep an eye on the repo for updates.
hey @ferrantim, @lukemarsden sorry for the delay here. I've been pushing code to the feat_arm
branch just to test the Disk APIs through Azure resource manager, since we now have some semi-decent ARM support in the python SDK.
The main problem for the holdup here are reliability problems with the Azure Disk API that I've been able to demonstrate to Azure engineering with code from this repo. Currently there is an issue where a virtual machine will become unusable after repeated attaches and detaches of disks which is no good. It's being tracked internally and I'll keep you updated as the status changes.
@sedouard thanks for the update, any news since Jan?
Hey @wallnerryan, They fixed the 'vm is broken after attach' issue pretty quickly since that was a bigger deal.
I built out a simple test suite using the ARM disk api's and simple attaching, then detaching a disk can leave the VM disagreeing with the Azure API as to what is attached and in what slot. Azure engineering hasn't provided us any developers to investigate our instances that currently have the reproduced issue. We also haven't got anyone to try the reproduction steps we've provided them.
If we can get more +1's on the issue, it would help push engineering more to fix the disk bugs which make attaching/detaching disks reliable. Flocker isn't the only platform hit by this,, however. It's pretty much anyone that needs this functionality.
@madhana is the product manager who might have more insight as to when this will get more attention.
Thanks @sedouard :+1: , we had a few users asking for it recently, can see if I can get them to +1
+1
+1
+1
+1
@sedouard @madhana getting more +1's. Let us know how we can help / feel free to reach out via email
Thanks for the +1's all.
Engineering actually started investigating this issue this past Monday. They're still sifting through logs trying to find the root cause for the attach/detach discrepency. Will keep this thread updated.
Thanks! Good to hear !
On May 24, 2016, at 2:49 PM, Steven Edouard notifications@github.com wrote:
Thanks for the +1's all.
Engineering actually started investigating this issue this past Monday. They're still sifting through logs trying to find the root cause for the attach/detach discrepency. Will keep this thread updated.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
+1
Any progress with the issue?
+1
Sorry you had to deal with Azure support. It can get rough.
Hey guys! Unfortunately this issue has outlasted my employment at Microsoft!
I've handed this off to @jmspring. He's driving the support ticket internally with azure engineering. Seeing as I don't have access to view the issue anymore maybe @madhana can fetch more details on what the ticket status is.
@sedouard just have someone share their active Microsoft/Azure credentials on here so the guys can contribute, I'm sure nothing bad will happen.
There is some traction on this issue.
I'm on vacation until Sunday and will respond more Monday, but it is being investigated.
-j
Sent from my iThingy
On Jun 23, 2016, at 22:55, Steven Edouard notifications@github.com wrote:
Hey guys! Unfortunately this issue has outlasted my employment at Microsoft!
I've handed this off to @jmspring. He's driving the support ticket internally with azure engineering. Seeing as I don't have access to view the issue anymore maybe @madhana can fetch more details on what the ticket status is.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@sedouard thanks for helping and good luck in your next opportunity! Looking forward to more into @jmspring thanks.
The product team has suggested some modifications to the driver (and a couple of other fixes) which I should be getting to later this week/early next week.
Ok, due to another project, I got delayed. I'm starting work on this starting tomorrow.
+1
+1
Current status:
Work is progressing on the Azure issues, setup/install and docs are next on my list, then updating the unit tests. Not ready for prime time, but a few steps closer.
Update -
To install: git clone https://github.com/CatalystCode/azure-flocker-driver.git cd azure-flocker-driver sudo /opt/flocker/bin/pip install .
Configuration: Look at azure-flicker-driver/example.azure_agent.yml for agent.yml contents.
@jmspring thanks for the update! This is good news. So in terms of basic usage sounds like its working, could you give some more details on the issue of ARM doesn't like multiple updates to the same VM happening in parallel, handling of this and hardening is needed
@wallnerryan - take for example you have a detach in progress (yet not done) and then shortly there after do an attach, the representation of the VM for the second operation will likely contain the data disk being detached. Thus a conflict results. The operation may fail or the drive being detached ends up still attached.
Note - work will be pulled into master now.
@jmspring thanks! Will looks forward to testing this out soon. cc @pcgeek86
Current state - the driver test takes about 25-30min to run, but having it go over night (about 30 tries) none failed.
When running under Flocker, there is the occasional case of a disk attach timeout -- think ping pong a volume between two VMs. I haven't looked closely at upping the docker/flocker timeout config.
I believe most of the issues originally encountered are ironed out, but testing and use are needed to guarantee.
Performance of attach/detach is usually between 30 and 60sec per operation. This is on the todo list to look into.
+1
What is the current status? Are you planning to work on it further? Is there anything I can do to help?