upgrade k8s process is broke

jalberto commented 6 years ago

Is this a request for help?: yes

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: 0.11

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) 1.7.6 -> 1.8.4

What happened:

Running:

acs-engine upgrade \
  --subscription-id xxx \
  --deployment-dir ./_output/xxx \
  --location westeurope \
  --resource-group xxx \
  --upgrade-version 1.8.4 \
  --auth-method client_secret \
  --client-id xxx \
  --client-secret xxx

got:

[...]
INFO[0138] Master pool set count to: 1 temporarily during upgrade...

INFO[0138] Starting ARM Deployment (master-18-01-10T13.40.08-1740684113). This will take some time... 
INFO[0166] Finished ARM Deployment (master-18-01-10T13.40.08-1740684113). 
INFO[0166] Error creating upgraded master VM: k8s-master-11577755-0 
FATA[0166] Error upgrading cluster: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details."

In azure portal:

screenshot from 2018-01-10 13-44-01

What you expected to happen: Upgrade cluster without errors

How to reproduce it (as minimally and precisely as possible):

Follow upgrade docs: https://github.com/Azure/acs-engine/tree/master/examples/k8s-upgrade

Anything else we need to know:

I am using vnet peering configured in azure portal, not using custom vnets

This actually left my cluster in bad shape, so is quite urgent.

jalberto commented 6 years ago

Is this a technical limitation or a non-implemented feature?

Probably 'upgrade' command must check for this commons pitfalls and throw a warning before being the process.

jalberto commented 6 years ago

Testing if deleting peering make it works, and it does but it takes quite long to upgrade a single master, and even more to the upgrade command to notice deployment has finished as it continues stuck in:

INFO[0210] Starting ARM Deployment (master-18-01-11T12.56.22-617391183). This will take some time...

screenshot from 2018-01-11 14-08-57

Meanwhile because of previous error, now there is not leader, so whole cluster is unusable. Documentation says this process is idempotent, but 2nd time I run the commend it didn't try to continue with the alread disabled master, but takes a new one, creating a mess.

jalberto commented 6 years ago

as expected now ther is this:

INFO[4992] Master VM: k8s-master-11577755-1 status error: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

So now what?

I already opened a cirtical issue in azure support but no one seems to be taking care of this.

jalberto commented 6 years ago

so after several retyes every master get updated but any kubectl command get a i/o timeout

any clue will be welcome

jalberto commented 6 years ago

Dear diary, 3 days since The incident and not I already lost any hope for help. Seems like this place is more isolated every day. Now seem like upgrade broke CNI (calico) in some way and nor etcd nor k8s services will start

Unable to update cni config: No networks found in /etc/cni/net.d
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

jalberto commented 6 years ago

more funny stuff:

-- Unit etcd.service has begun starting up.                                                                                                                                     
Jan 11 18:31:17 k8s-master-11577755-0 systemd[22069]: etcd.service: Failed at step EXEC spawning /usr/bin/etcd: No such file or directory                                       
-- Subject: Process /usr/bin/etcd could not be executed

brendandburns commented 6 years ago

Can you do an ls /usr/bin/etcd, also can you look at /var/log/cloud-init.log and see if cloud-init is correctly installing/running?

jalberto commented 6 years ago

@brendanburns I was investigating etcd when I found is not instaled, so I tried to install ti acs-engine way:

root@k8s-master-11577755-0:~# /opt/azure/containers/setup-etcd.sh                                                                                                               
+ ETCD_VER=v2.5.2                           
+ DOWNLOAD_URL=https://acs-mirror.azureedge.net/github-coreos                           
+ mkdir -p /tmp/etcd-download               
+ curl -L https://acs-mirror.azureedge.net/github-coreos/etcd-v2.5.2-linux-amd64.tar.gz -o /tmp/etcd-v2.5.2-linux-amd64.tar.gz                                                  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current         
                                 Dload  Upload   Total   Spent    Left  Speed           
100   215  100   215    0     0     74      0  0:00:02  0:00:02 --:--:--    74          
+ tar xzvf /tmp/etcd-v2.5.2-linux-amd64.tar.gz -C /usr/bin/ --strip-components=1        

gzip: stdin: not in gzip format             
tar: Child returned status 1                
tar: Error is not recoverable: exiting now  
+ useradd -U etcd                           
useradd: user 'etcd' already exists         
++ head -c 32 /dev/urandom                  
++ base64                                   
+ usermod -p 5ICuQ0258U/OBqze1SjjZi8g+9+/pSL0LtkG7nquQbU= etcd                          
sent invalidate(passwd) request, exiting    
sent invalidate(group) request, exiting     
sent invalidate(passwd) request, exiting    
sent invalidate(group) request, exiting     
+ passwd -u ''                              
passwd: user '' does not exist              
+ systemctl daemon-reload                   
+ systemctl enable etcd.service             
Failed to execute operation: Invalid argument

root@k8s-master-11577755-0:~# cat /tmp/etcd-v2.5.2-linux-amd64.tar.gz
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:02f5e54f-001e-008b-390e-8b5fd6000000
Time:2018-01-11T18:59:05.6635280Z</Message></Error>

~# tail /var/log/cloud-init.log                              
2018-01-11 18:09:44,187 - stages.py[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_power_state_change.py'>) with frequency once-per-instance                      
2018-01-11 18:09:44,188 - handlers.py[DEBUG]: start: modules-final/config-power-state-change: running config-power-state-change with frequency once-per-instance                
2018-01-11 18:09:44,188 - helpers.py[DEBUG]: config-power-state-change already ran (freq=once-per-instance)                                                                     2018-01-11 18:09:44,188 - handlers.py[DEBUG]: finish: modules-final/config-power-state-change: SUCCESS: config-power-state-change previously ran                                
2018-01-11 18:09:44,188 - main.py[DEBUG]: Ran 20 modules with 0 failures                
2018-01-11 18:09:44,189 - util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'                                   
2018-01-11 18:09:44,190 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)       
2018-01-11 18:09:44,190 - util.py[DEBUG]: Read 12 bytes from /proc/uptime               
2018-01-11 18:09:44,190 - util.py[DEBUG]: cloud-init mode 'modules' took 0.378 seconds (0.38)                                                                                   2018-01-11 18:09:44,190 - handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final

lots of fun

brendandburns commented 6 years ago

What version of acs-engine are you using?

Thanks

jalberto commented 6 years ago

@brendanburns 0.11

Thanks for take a look!

jalberto commented 6 years ago

I actually cannot find an official 2.5.2 release:

screenshot from 2018-01-11 20-38-51

So maybe 2.5.2 is just a dummy version used as default for some reason?

jackfrancis commented 6 years ago

Hi @jalberto could you try to modify the /opt/azure/containers/setup-etcd.sh script to change the assignment of ETCD_VER to v2.3.7?

jalberto commented 6 years ago

I have doubts, Why not latest 3.x version? (I don't want to mess it more)

jackfrancis commented 6 years ago

I assume your cluster was running a v2 of etcd before, things are likelier to come back online if you use that version. (It's the latest v2 released, v2.5.2 was a typo that should have been v2.2.5; this was fixed in master a few weeks ago).

jalberto commented 6 years ago

that worked! thanks!

I am trying to resume the upgrade now

jackfrancis commented 6 years ago

Thanks for being resilient here. You are contributing to acs-engine-driven upgrades becoming more of a 1st class feature, we appreciate it!

jalberto commented 6 years ago

@CecileRobertMichon I think this is not closed yet, I am still evaluating if everything is right after upgrade.

I can see problem mounting azure-disk volumes

CecileRobertMichon commented 6 years ago

@jalberto sorry about that, I thought the issue had been resolved. Re-opening it now.

jalberto commented 6 years ago

no prob @CecileRobertMichon

So after upgrade my PVs using azure-file (shared volumes using CIFS) are not automatically created anymore (I have an storage class) even if created by hand, then it cannot be mounted becuase a permission problem.

More details in how StorageClass was created: https://github.com/Azure/acs-engine/issues/1506

jalberto commented 6 years ago

I fixed azurefile by deleting adn recreating the StorageClass

Now I got a new problem quite critical:

Multi-Attach error for volume "pvc-3a32c7b6-fd48-11e7-b811-000d3a2da567" Volume is already exclusively attached to one node and can't be attached to another

This happens each time a pod need to move from one node to another. I checked in azure portal and I can see how the disk move to the correct node correctly, and kubectl says the PVC is bound.

probably related to: #2002

andyzhangx commented 6 years ago

@jalberto You got this error when moving pod with an azure disk (not azure file) to another node, right? This could happen since azure disk detach from a node costs minutes in current k8s version, but wait for a few minutes should work.

jalberto commented 6 years ago

@andyzhangx even after 4h still not working

andyzhangx commented 6 years ago

@jalberto could you run "kubectl describe po POD-NAME" and paste the output here?

andyzhangx commented 6 years ago

@jalberto and what's your k8s version? is it managed disk VM or unmanaged disk VM? What's your k8s operations? kubectl drain POD-NAME? I would try to repro this issue in my env if possible.

andyzhangx commented 6 years ago

I have checked the k8s upstream code, it's likely due to incorrect status of volume status, that is the volume is actually removed from one node, while its volume status is not updated, volume still belonging to that node. I tried twice on my env, it could not be easily reproed since it's due to Meltdown patch as you mentioned. So I would suggest you could follow this guide : Volume status out of sync when kubelet restarts, one solution is restart kubelet in the original node and also the node with azure disk (sudo systemctl restart kubelet), and then after a few minutes, check the pod & volume status.

jalberto commented 6 years ago

thanks @andyzhangx for your time. I only was able to "fix" it by rebooting whole node, this doesn't even count as a workaround, is someone in azure side fixing this?

This is critical, as if for whatever reason a pod is re-scheduler it will not get correct state

andyzhangx commented 6 years ago

@jalberto so your problem(Multi-Attach error) always happen in the rescheduler? could you also provide below info:

k8s version
your pod config

There is possiblity that in the Intel CPU Azure update around 2017.1.5, VM disk status is inconsistent, and workaround is as I suggested in https://github.com/Azure/acs-engine/issues/2002

az vm update -g <group> -n <name>

If that still does not work, run sudo systemctl restart kubelet in the two agent VMs.

jalberto commented 6 years ago

hi @andyzhangx I am on k8s 1.8.4 about pod configuratoin si nto important, as it happens with any pod using a disk.

I already tryed that workaround, but is temporal, it only fixes current inconsistencies, not future one, so each time a pod move to a different node it needs manual intervention to finish causing lot of problems.

andyzhangx commented 6 years ago

@jalberto Are you using deployment or statefulset? Per my experience, deployment with azure disk volumeMounts would have such issue, statefulset would ensure the disk has been removed before starting a new pod. StatefulSet with azure disk volumeMount example

malonedrakes19 commented 6 years ago

Im stuck at the setup mode of my zte982. I can't get passed the email mode

On Feb 13, 2018 4:56 AM, "Andy Zhang" notifications@github.com wrote:

@jalberto https://github.com/jalberto Are you using deployment or statefulset? Per my experience, deployment with azure disk volumeMounts would have such issue, statefulset would ensure the disk has been removed before starting a new pod.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Azure/acs-engine/issues/2022#issuecomment-365212112, or mute the thread https://github.com/notifications/unsubscribe-auth/AicChlK85LIPIVz6qsd8eFDgdbE2m8cMks5tUVwzgaJpZM4RZRU6 .

jalberto commented 6 years ago

@andyzhangx

❯ az vm update -g k8svl -n k8s-pool01-xxxxx-1
Cannot attach data disk 'k8svl-dynamic-pvc-xxxxxx' to VM 'k8s-pool01-xxxxx-1' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

This happens every time a pod with a persisten volumen needs to move

andyzhangx commented 6 years ago

@jalberto are you using k8s Deployment or StatefulSet?

jalberto commented 6 years ago

This one is a Deployment (official traefik chart) but it happens the same with a StaefulSet @andyzhangx

Anyhow, it doesn't matter, as it was working before the latest patches to VMs

Main problem is not the pod trying to be created before or after, the problem is the inconsistency in disk volumens

andyzhangx commented 6 years ago

@jalberto would you try Powershell solution: https://blogs.technet.microsoft.com/mckittrick/azure-vm-stuck-in-failed-state-arm/

Get-AzureRmVM -Name "VMname" -ResourceGroupName 'RGName' | Update-AzureRmVM

andyzhangx commented 6 years ago

And what's your VM status now? This issue is related to VM status inconsistency, you may file an incident to Azure VM support.

jalberto commented 6 years ago

@andyzhangx check my comments in #2002 please

This is not a particular problem with my VMs (also created by acs-egine BTW) as I am no t the only one with this problem, also is not frist time as this seems a recurrent problem around here.

andyzhangx commented 6 years ago

@jalberto if you are using Deployment, you are very likely to hit this issue when reschedule a pod with azure disk mount on one VM to another. Would you try statefulset again this time?

jalberto commented 6 years ago

@andyzhangx I am using official helm chart, it's using a deployment, I knwo statefulset is better option, but root problem is not deployment vs statefulset is attach/dettach disk inconsistency in azure side.

I have exactly same problem with StatefulSet. To ask me to move a prod LB from official helm chart to custom StatefulSet cannot be a solution for this, not even temporal one.

This happens every tie, any o my pods need to move, mostly of my pods with volumes are StatefuSet. The problem is not there, but in Azure

Powershell command has worked this time, but I am not so sure about next time, this caused my company to be 2h offline

andyzhangx commented 6 years ago

One workaround is let that helm chart use azure file, you may change the default storage class as azurefile if your helm chart is using default storage class. Move pod with azure disk mount from one node to another node is really not a good solution since attach/detach disk from a VM sometimes costs minutes. Azure file won't have such attach/detach issue.

andyzhangx commented 6 years ago

@jalberto BTW, what's your helm chart and other config? I may try to repro on my testing env if possible. Thanks.

jalberto commented 6 years ago

@andyzhangx thanks for your time.

chart is using azure disk, to move to azure file meant to lost current data
this is the chart (but it happens with other charts): https://github.com/kubernetes/charts/tree/master/stable/traefik

andyzhangx commented 6 years ago

@jalberto Hi, I have tried on my testing env using deployment with azure disk mount, when scheduling a pod from one node to another, there will be around 3mins hitting following error:

Multi-Attach error for volume "pvc-c048abf5-1189-11e8-846f-000d3af71b87" Volume is already exclusively attached to one node and can't be attached to another

But after 3mins, it will recover and shows mounting successfully on another node.

So for your issue, there is a possible solution to mitigate, that is

create new nodes by acs-engine scale up
kubectl drain old nodes which was created before 2017.1.5 that have this attach error issue
delete your old VMs

jalberto commented 6 years ago

closing even if this is not solve in anyway, but there is a follow up in #2567

andyzhangx commented 6 years ago

The disk attach error is probably due to lack of lock before detach disk, below are all details, I would encourage to use fixed version:

1. disk attach error

Issue details:

In some corner case(detaching multiple disks on a node simultaneously), when scheduling a pod with azure disk mount from one node to another, there could be lots of disk attach error(no recovery) due to the disk not being released in time from the previous node. This issue is due to lack of lock before DetachDisk operation, actually there should be a central lock for both AttachDisk and DetachDisk opertions, only one AttachDisk or DetachDisk operation is allowed at one time.

The disk attach error could be like following:

Cannot attach data disk 'cdb-dynamic-pvc-92972088-11b9-11e8-888f-000d3a018174' to VM 'kn-edge-0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

Related issues

Mitigation:

option#1: Update every agent node that has attached or detached the disk in problem

in Azure cloud shell, run

$vm = Get-AzureRMVM -ResourceGroupName $rg -Name $vmname  
Update-AzureRmVM -ResourceGroupName $rg -VM $vm -verbose -debug

in Azure cli, run

az vm update -g <group> -n <name>

option#2: 1) kubectl cordon node #make sure no scheduling on this node 2) kubectl drain node #schedule pod in current node to other node 3) restart the Azure VM for node via the API or portal, wait untli VM is "Running" 4) kubectl uncordon node

Fix

PR fix race condition issue when detaching azure disk has fixed this issue by add a lock before DetachDisk

k8s version	fixed version
v1.6	no fix since v1.6 does not accept any cherry-pick
v1.7	1.7.14
v1.8	1.8.9
v1.9	1.9.5
v1.10	1.10.0

jalberto commented 6 years ago

thanks @andyzhangx but this problem continue happening in 1.9.6 deployed with 0.14.5, a simple upgrade can take up to 1h to be complete

andyzhangx commented 6 years ago

@jalberto is it still disk attach error or some other error?

jalberto commented 6 years ago

same error, in old cluster it will require manually intervention, as you described, each time it happened. in new cluster, so far, it works, but takes from 14mins to 2h to fix itself, mostly time sI have not the patiecne and do manually commands to fix it (14min of downtime is too much)

andyzhangx commented 6 years ago

@jalberto The new cluster means v1.9.6? Could you paste that error msg again? And do you have a clean config to repro this issue, I would like to repro it in my env if it's after v1.9.5.

jalberto commented 6 years ago

yes 1.9.6 deployed with acs 0.14.5

you can check my apimodel in #2567

I cannot paste it now, but the error is the typical "Multi attach error"

jalberto commented 6 years ago

@andyzhangx , this is the error:

 Multi-Attach error for volume "pvc-3ca68971-35d7-11e8-a4e3-000d3a205895" Volume is already exclusively attached to one node and can't be attached to another
Unable to mount volumes for pod "vpn-openvpn-69bc585689-fz77t_default(d4a11d6b-3978-11e8-aa0a-000d3a205895)": timeout expired waiting for volumes to attach/mount for pod "default"/"vpn-openvpn-69bc585689-fz77t". list of unattached/unmounted volumes=[certs]

32mins and counting :)

Azure / acs-engine