apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.99k stars 1.09k forks source link

Ceph RBD primary storage fails connection and renders node unusable #2611

Closed giorgiomassar8 closed 2 years ago

giorgiomassar8 commented 6 years ago
ISSUE TYPE
COMPONENT NAME

Cloudstack agent

CLOUDSTACK VERSION

4.11

CONFIGURATION

KVM cluster with CEPH backed RDB primary storage

OS / ENVIRONMENT

Ubuntu 16.04 / 14.04

SUMMARY

On a perfectly working 4.10 node with KVM hypervisor and Ceph RBD primary storage, after upgrading to 4.11, cloudstack agent is unable to connect the BRD pool in libvirt, giving just a generic "operation not supported" error in its logs:

2018-04-06 16:27:37,650 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt 2018-04-06 16:27:37,652 WARN [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Storage pool be80af6a-7201-3410-8da4-9b3b58c4954f was not found running in libvirt. Need to create it. 2018-04-06 16:27:37,653 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Didn't find an existing storage pool be80af6a-7201-3410-8da4-9b3b58c4954f by UUID, checking for pools with duplicate paths 2018-04-06 16:27:37,664 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:91b4e1df) Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported 2018-04-06 16:27:42,762 INFO [cloud.agent.Agent] (Agent-Handler-4:null) (logid Lost connection to the server. Dealing with the remaining commands...

Exactly the same pool was previously working before upgrade:

2018-04-06 12:53:52,847 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt 2018-04-06 12:53:52,850 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Found existing defined storage pool be80af6a-7201-3410-8da4-9b3b58c4954f, using it. 2018-04-06 12:53:52,850 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:14dace5e) Trying to fetch storage pool be80af6a-7201-3410-8da4-9b3b58c4954f from libvirt 2018-04-06 12:53:53,171 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:14dace5e) Proccess agent ready command, agent id = 46

STEPS TO REPRODUCE

Take an existing and working cloudstack cluster with 4.10, with RDB primary storage and Ubuntu 14.04 based agents and upgrade them to version 4.11 of the agent.

EXPECTED RESULTS

The cluster should be working fine, the agents should be connecting and the RDB pool should be correctly opened in libvirt.

ACTUAL RESULTS

Cloudstack agents fails to boot with a generic "Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported" error and loops in a failed state, rendering the machine unusable.

WORKAROUND

To workaround the issue I have tried to use the following XML config (dumped from another node where it is correctly running) and define the pool directly in libvirt, and it worked as expected:

be80af6a-7201-3410-8da4-9b3b58c4954f be80af6a-7201-3410-8da4-9b3b58c4954f cephstor1

virsh pool-define test.xml Pool be80af6a-7201-3410-8da4-9b3b58c4954f defined from test.xml

root@compute6:~# virsh pool-start be80af6a-7201-3410-8da4-9b3b58c4954f Pool be80af6a-7201-3410-8da4-9b3b58c4954f started

root@compute6:~# virsh pool-info be80af6a-7201-3410-8da4-9b3b58c4954f Name: be80af6a-7201-3410-8da4-9b3b58c4954f UUID: be80af6a-7201-3410-8da4-9b3b58c4954f State: running Persistent: yes Autostart: no Capacity: 10.05 TiB Allocation: 2.22 TiB Available: 2.71 TiB

And now the cloudstack agent correctly starts:

2018-04-09 10:29:19,989 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:f0021131) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt 2018-04-09 10:29:19,990 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:f0021131) Found existing defined storage pool be80af6a-7201-3410-8da4-9b3b58c4954f, using it. 2018-04-09 10:29:19,991 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:f0021131) Trying to fetch storage pool be80af6a-7201-3410-8da4-9b3b58c4954f from libvirt 2018-04-09 10:29:20,372 INFO [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:f0021131) Proccess agent ready command, agent id = 56

rohityadavcloud commented 6 years ago

I've marked this as blocker applicable for Ceph users, we'll try to get this fixed. @giorgiomassar8 can you advise any documentation on how to setup a small ceph storagepool to test this against KVM?

wido commented 6 years ago

Can you increase the logging to DEBUG on the Agent and show the XML which the Agent generates to define the storage pool?

The error you are seeing comes from libvirt+librados underneath so I'd need to see the XML it generates.

giorgiomassar8 commented 6 years ago

I have set debugging on one of the agents and this is what I get:

2018-05-02 13:41:44,401 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f (RBD) in libvirt
2018-05-02 13:41:44,401 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:d06ebad5) Looking for libvirtd connection at: qemu:///system
2018-05-02 13:41:44,403 WARN  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Storage pool be80af6a-7201-3410-8da4-9b3b58c4954f was not found running in libvirt. Need to create it.
2018-05-02 13:41:44,403 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Didn't find an existing storage pool be80af6a-7201-3410-8da4-9b3b58c4954f by UUID, checking for pools with duplicate paths
2018-05-02 13:41:44,404 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Checking path of existing pool 9fb695bd-a47f-45d3-9fe7-f029dff431d8 against pool we want to create
2018-05-02 13:41:44,408 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Attempting to create storage pool be80af6a-7201-3410-8da4-9b3b58c4954f
2018-05-02 13:41:44,408 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) <pool type='rbd'>
<name>be80af6a-7201-3410-8da4-9b3b58c4954f</name>
<uuid>be80af6a-7201-3410-8da4-9b3b58c4954f</uuid>
<source>
<host name='storagepool1' port='6789'/>
<name>cephstor1</name>
</source>
</pool>

2018-05-02 13:41:44,415 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:d06ebad5) Failed to create RBD storage pool: org.libvirt.LibvirtException: failed to connect to the RADOS monitor on: storagepool1:6789,: Operation not supported
2018-05-02 13:41:44,415 DEBUG [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:d06ebad5) Seq 1-6775102689425489922:  { Ans: , MgmtId: 29010072043598, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.UnsupportedAnswer":{"result":false,"details":"Unsupported command issued: com.cloud.agent.api.ModifyStoragePoolCommand.  Are you sure you got the right type of server?","wait":0}}] }
2018-05-02 13:41:44,451 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-4:null) (logid:d06ebad5) Execution is successful.
2018-05-02 13:41:44,453 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) (logid:d06ebad5) Watch Sent: Seq 1-6775102689425489921:  { Ans: , MgmtId: 29010072043598, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"details":"","wait":0}}] }
2018-05-02 13:41:44,464 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/10.20.0.21,port=8250,localport=45374] closed on read.  Probably -1 returned: Connection closed with -1 on reading size.
2018-05-02 13:41:44,464 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/10.20.0.21,port=8250,localport=45374]
2018-05-02 13:41:44,464 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Clearing watch list: 2
2018-05-02 13:41:46,728 INFO  [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Connected to the server
2018-05-02 13:41:49,465 INFO  [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Lost connection to the server. Dealing with the remaining commands...

In the generated XML there isn't the secret part, perhaps the agent is trying to use another authentication mechanism which this version of libvirt does not support?

This is what I have been using as a workaround, and it works:

<pool type="rbd">
  <name>be80af6a-7201-3410-8da4-9b3b58c4954f</name>
  <uuid>be80af6a-7201-3410-8da4-9b3b58c4954f</uuid>
  <source>
    <name>cephstor1</name>
    <host name='storagepool1' port='6789'/>
    <auth username='admin' type='ceph'>
      <secret uuid='OMITTED'/>
    </auth>
  </source>
</pool>
wido commented 6 years ago

Ok, I see. The operation not supported is odd, but it seems to be missing the cephx part indeed.

At the moment I do not have a 4.11 environment to test this with, but I'll look at the code later.

nathanejohnson commented 6 years ago

@wido The issue seems to be that for any KVM host, the disk format is always hard coded to QCOW2 here:

https://github.com/apache/cloudstack/blob/master/engine/orchestration/src/main/java/org/apache/cloudstack/engine/orchestration/VolumeOrchestrator.java#L768

This gets sent along to the agent. I guess this is a reasonable default for KVM, but obviously for RBD it needs to be RAW, and it won't know that until it looks up the storage pool. In the short term, I think hard coding the disk format to RAW on the agent side for RBD is valid, it can't be anything but that, but it might also be good for the management server to send along the correct disk type.

nathanejohnson commented 6 years ago

On second thought, this might be a separate bug we were chasing. We were getting seeing this error in the logs:

2018-05-03 09:58:03,932 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:fb29bd87) Attempting to create volume f2d800a0-fc19-4b41-b886-57
5af6742425 (RBD) in pool e485c675-896a-303e-be82-c42357c6b4c1 with size 5368709120
2018-05-03 09:58:03,932 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:fb29bd87) <volume>
<name>f2d800a0-fc19-4b41-b886-575af6742425</name>
<capacity>5368709120</capacity>
<target>
<format type='qcow2'/>
<permissions><mode>0744</mode></permissions></target>
</volume>

2018-05-03 09:58:03,932 DEBUG [kvm.storage.KVMStorageProcessor] (agentRequest-Handler-3:null) (logid:fb29bd87) Failed to create volume: 
com.cloud.utils.exception.CloudRuntimeException: org.libvirt.LibvirtException: unsupported configuration: only RAW volumes are supported by this storage pool
at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createPhysicalDiskByLibVirt(LibvirtStorageAdaptor.java:705)
at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createPhysicalDisk(LibvirtStorageAdaptor.java:664)
at com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.createPhysicalDisk(LibvirtStoragePool.java:115)
at com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.createVolume(KVMStorageProcessor.java:1323)
at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:119)
at com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:58)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStorageSubSystemCommandWrapper.execute(LibvirtStorageSubSystemCommandWrapper.java:36)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStorageSubSystemCommandWrapper.execute(LibvirtStorageSubSystemCommandWrapper.java:30)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:75)
at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1438)
at com.cloud.agent.Agent.processRequest(Agent.java:563)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:976)
at com.cloud.utils.nio.Task.call(Task.java:83)
at com.cloud.utils.nio.Task.call(Task.java:29)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

@giorgiomassar8 were you seeing similar messages? We were receiving these messages on a freshly created pool when attempting to deploy a VM.

wido commented 6 years ago

@kiwiflyer Have you seen this one? The Agent is not providing the secret for the RBD storage pool to the libvirt XML definition.

GabrielBrascher commented 6 years ago

I will see if I can deploy a test environment to reproduce this issue meanwhile.

giorgiomassar8 commented 6 years ago

@nathanejohnson nope I don't see qcow2/raw mismatch errors here, I do believe the issue is related to the fact that for some reasons, no authentication information is put in the pool's XML sent to libvirt.

nathanejohnson commented 6 years ago

@giorgiomassar8 here are the logs from when the rbd pool was configured on our host:

2018-05-02 12:07:01,225 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:328be924) Request:Seq 1-8891231564336201737:  { Cmd , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CreateStoragePoolCommand":{"_createDatastore":false,"add":true,"pool":{"id":3,"uuid":"e485c675-896a-303e-be82-c42357c6b4c1","host":"192.168.100.100","path":"acs411testing","userInfo":"acs411testing:supersecretkeygoeshere==","port":6789,"type":"RBD"},"localPath":"/mnt//dcf0b890-984a-3c12-9190-050f3294a590","wait":0}}] }
2018-05-02 12:07:01,225 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:328be924) Processing command: com.cloud.agent.api.CreateStoragePoolCommand
2018-05-02 12:07:01,225 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:328be924) Seq 1-8891231564336201737:  { Ans: , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"details":"success","wait":0}}] }
2018-05-02 12:07:01,313 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:328be924) Request:Seq 1-8891231564336201738:  { Cmd , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.ModifyStoragePoolCommand":{"add":true,"pool":{"id":3,"uuid":"e485c675-896a-303e-be82-c42357c6b4c1","host":"192.168.100.100","path":"acs411testing","userInfo":"acs411testing:supersecretkeygoeshere==","port":6789,"type":"RBD"},"localPath":"/mnt//dcf0b890-984a-3c12-9190-050f3294a590","wait":0}}] }
2018-05-02 12:07:01,313 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:328be924) Processing command: com.cloud.agent.api.ModifyStoragePoolCommand
2018-05-02 12:07:01,313 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Attempting to create storage pool e485c675-896a-303e-be82-c42357c6b4c1 (RBD) in libvirt
2018-05-02 12:07:01,313 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:328be924) Looking for libvirtd connection at: qemu:///system
2018-05-02 12:07:01,319 WARN  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Storage pool e485c675-896a-303e-be82-c42357c6b4c1 was not found running in libvirt. Need to create it.
2018-05-02 12:07:01,319 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Didn't find an existing storage pool e485c675-896a-303e-be82-c42357c6b4c1 by UUID, checking for pools with duplicate paths
2018-05-02 12:07:01,319 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Checking path of existing pool d358f5c2-f5fd-43a8-844e-cf01183de220 against pool we want to create
2018-05-02 12:07:01,322 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Attempting to create storage pool e485c675-896a-303e-be82-c42357c6b4c1
2018-05-02 12:07:01,323 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) <secret ephemeral='no' private='no'>
<uuid>e485c675-896a-303e-be82-c42357c6b4c1</uuid>
<usage type='ceph'>
<name>acs411testing@192.168.100.100:6789/acs411testing</name>
</usage>
</secret>

2018-05-02 12:07:01,388 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) <pool type='rbd'>
<name>e485c675-896a-303e-be82-c42357c6b4c1</name>
<uuid>e485c675-896a-303e-be82-c42357c6b4c1</uuid>
<source>
<host name='192.168.100.100' port='6789'/>
<name>acs411testing</name>
<auth username='acs411testing' type='ceph'>
<secret uuid='e485c675-896a-303e-be82-c42357c6b4c1'/>
</auth>
</source>
</pool>

2018-05-02 12:07:01,434 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Trying to fetch storage pool e485c675-896a-303e-be82-c42357c6b4c1 from libvirt
2018-05-02 12:07:01,434 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:328be924) Looking for libvirtd connection at: qemu:///system
2018-05-02 12:07:01,439 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:328be924) Succesfully refreshed pool e485c675-896a-303e-be82-c42357c6b4c1 Capacity: 11413921890304 Used: 0 Available: 10496547282944

do you see the key go across before it attempts to create the libvirt storage pool?

kiwiflyer commented 5 years ago

So working with a user, it does appear that the original issue outlined by @giorgiomassar8 is different from the one we found. The user in question is running Ubuntu 16.04 and no authentication is being passed to libvirt.

image

This particular installation is a 4.9.3 -> 4.11.1 upgrade.

kiwiflyer commented 5 years ago

rbd

wido commented 5 years ago

Are there by any chance slashes (/) in the secret of this RBD pool?

kiwiflyer commented 5 years ago

Funny you should mention that. There is a single slash and I did notice a very old open bug report on jira for that issue. This user had upgraded a prod system, so I had a very limited window to take a look.

On Sun, Nov 18, 2018, 2:35 AM Wido den Hollander notifications@github.com wrote:

Are there by any chance slashes (/) in the secret of this RBD pool?

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/apache/cloudstack/issues/2611#issuecomment-439676341, or mute the thread https://github.com/notifications/unsubscribe-auth/AQek8gqrA6MTISjaYPX6UambYNX8trOZks5uwRu7gaJpZM4Tqnd3 .

wido commented 5 years ago

Could you check that for me? As I have seen issues with that in the secret information and URL parsing inside Java.

Otherwise I wouldn't know where this is coming from.

kiwiflyer commented 5 years ago

The pool secret apparently does not have any slashes. The encrypted version in the DB does, but obviously that shouldn't matter. I'm trying to work back through the code to see what might have changed between 4.9.x and 4.11.x. The secret is set correctly in the 4.9.3 release.

kiwiflyer commented 5 years ago

ok...well it's definitely not decrypting the userInfo before it gets sent to the agent in the upgrade case above. He's our 4.11 lab with debug enabled. I restarted libvirtd to remove the storage pool. I then restarted the cloudstack-agent. the agent then tries to add the storage pool back - 2018-11-21 10:54:18,733 DEBUG [cloud.agent.Agent] (agentRequest-Handler-3:null) (logid:3144d09f) Request:Seq 1-5622462659795353602: { Cmd , MgmtId: 14038005752684, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.ModifyStoragePoolCommand":{"add":true,"pool":{"id":3,"uuid":"e485c675-896a-303e-be82-c42357c6b4c1","host":"192.168.100.100","path":"acs411testing","userInfo":"acs411testing:AQDk3ela0NMAMxAA-4BkLN20KPNiEm2aJ8WVAg==","port":6789,"type":"RBD"},"localPath":"/mnt//dcf0b890-984a-3c12-9190-050f3294a590","wait":0}}] }

kiwiflyer commented 5 years ago

In the Ubuntu case above, it looks like the encrypted db payload in the userInfo field, rather than the decrypted version with colon delimiter that is expected in the code here: https://github.com/apache/cloudstack/blob/4809fe7cb434010adc776a0da40cdf37848cc26e/plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L245, so hence NULL gets returned and then libvirt throws an exception.

kiwiflyer commented 5 years ago

Note that my test lab is Centos 7.

kiwiflyer commented 5 years ago

@wido or @GabrielBrascher , do either of you guys have both an older 4.9/4.10 lab as well as a 4.11 lab on Ubuntu 16.04?

It would be interesting to test trying a manual decrypt on the user_info field from the storage_pool data using a 4.9.x payload, but trying the decrypt using a 4.11 installation.

e.g. java -classpath /usr/share/cloudstack-common/lib/jasypt-1.9.2.jar org.jasypt.intf.cli.JasyptPBEStringDecryptionCLI decrypt.sh input=encrypted_payload_from_user_info password=encrypt_password verbose=true

wido commented 5 years ago

@kiwiflyer All our test and production systems are 4.11 at the moment running on Ubuntu 16.04

They were all upgraded from 4.9 to 4.10 and then to 4.11. In none of the cases we ran in to this.

rohityadavcloud commented 5 years ago

@wido @kiwiflyer is this still a bug wrt 4.11.2.0 or next 4.11.3.0/4.12.0.0 ?

wido commented 5 years ago

I have not been able to verify this bug, so I can't tell you.

kiwiflyer commented 5 years ago

I saw this problem in an ACS mailing list users system after attempting an upgrade from 4.9.3 -> 4.11.1. Unfortunately, the bug was in production, so I only had an extremely limited window to take a look before they restored their previous 4.9.3 version. At the present time, I'm not sure what caused it and I haven't been able to reproduce this in a lab environment.
This definitely wasn't a one off situation, as they attempting this upgrade multiple times with the same result.

kiwiflyer commented 5 years ago

So, @nathanejohnson and I just started some upgrade testing on some lab environments going from our 4.8 production branch to 4.12 and we may have just run into this same bug. We'll triage it and update everyone early next week.

nathanejohnson commented 5 years ago

@wido I can second what @kiwiflyer mentioned earlier: The reason pools aren't being created has to do with the user_info being passed along encrypted, the split doesn't find a semicolon so it skips the secret section of the pool xml. It looks like the only time it's sent along decrypted is when the pool is initially created..

Obviously I'm missing something, I can't find anything significant that's different between 4.8 and 4.12's handling of this logic-wise, but this is failing consistently now in our lab environment. Would love some ideas on troubleshooting

nathanejohnson commented 5 years ago

as a followup, I cannot reproduce this in our fresh installed lab, only the one we upgraded. The userInfo comes across decrypted there.

wido commented 5 years ago

This seems like a management server thing where it doesn't send the data properly. Because I can't think of a reason why this happens on the Agent/KVM side.

I honestly have no idea.

nathanejohnson commented 5 years ago

I narrowed down the issue to where at some point during the upgrade process the db.properties file had switched from using encryption to not, so the user_info field was stored encrypted in the db but then was being sent without being decrypted on the way to the agent. I see in the spec file that the db.properties file is listed as config(noreplace), but I think in some situations that will still end up overwritten. At one point during the upgrade, I completely removed the RPM's and had to downgrade, even though I left the /etc/cloudstack/management directory in place I think when I then reinstalled the RPMs it overwrote db.properties since the file wasn't marked as owned by a previous installation, if this makes sense.

At any rate, I suspect something similar might have occurred with the original bug. I'm going to try upgrading another cluster, paying attention to the db.properties file and see if this happens again.

GabrielBrascher commented 5 years ago

@nathanejohnson any update on this one? I am not sure if this issue is a blocker for 4.12, considering that not all contributors were able to reproduce it (@wido was able to upgrade from 4.10 -> 4.11 -> 4.12).

GabrielBrascher commented 5 years ago

We are still unable to reproduce this issue. Thus, I am removing this as a Blocker for 4.12.

nathanejohnson commented 5 years ago

Sorry for the late reply, i have not had a chance to upgrade our other development cluster but I agree this shouldn't be a blocker.

rohityadavcloud commented 5 years ago

@nathanejohnson @giorgiomassar8 @GabrielBrascher @wido is this still an issue, should we close it or is there a PR coming?

rohityadavcloud commented 5 years ago

@nathanejohnson @giorgiomassar8 @GabrielBrascher @wido kindly discuss and advise a milestone in case a PR is coming soon. I've removed this issue from 4.13.0.0.

DaanHoogland commented 2 years ago

@giorgiomassar8 there's been no activity on this for two year. I'm closing the ticket, please reopen if you feel it still should be.

xuanyuanaosheng commented 2 years ago

Any date ? https://github.com/apache/cloudstack/issues/5741