cloudsoft / jclouds-vcloud-director

0 stars 9 forks source link

When fire several blueprints at once VCD responds with "duplicate name (400) error" #42

Open bostko opened 8 years ago

bostko commented 8 years ago

Environment

Steps to reproduce Launch it in Apache Brooklyn with jclouds-vcloud-director and fire where 10 blueprints are deployed simultaneously.

Observed behaviour Digging out debug logs it appeared that when those 10 provisioning were triggered the last ~5 of them were failing on POST /vdc/{id}/action/composeVApp with OPERATION_LIMITS_EXCEEDED

Inspecting the log showed that between sending composeVapp request and receiving a response 3 or 4 check task status requests happened (GET /task/{taskId}) After a response from GET task a response is returned from composeVapp which says:

<Error xmlns="http://www.vmware.com/vcloud/v1.5" minorErrorCode="OPERATION_LIMITS_EXCEEDED" message="[ {{requestId}} ] The maximum number of simultaneous operations for user &quot;{{User}}&quot; on organization &quot;{{Organization}}&quot; has been reached." majorErrorCode="400" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.vmware.com/vcloud/v1.5

If I check VCloud Director web console I see that vApps are leftover in pending state from the failed composeVapp request. They are left for ever in pending (not created nor started stage) and I assume no resources or vms are allocated)

Because of retry logic for such responses implemented previously in https://github.com/cloudsoft/jclouds-vcloud-director/pull/41 jclouds-vcloud-director issues a second composeVapp request and then it fails with vApp duplication name.

Expected behavior VCloud Director is not expected to return half done operations. Most business applications out there take care to fully abort unsuccessful operations.

Possible Workarounds that could be implemented in jclouds-vcloud-director

aledsage commented 8 years ago

It seems that the core problem is in VMware's vCloud Director implementation. We issue a POST /vdc/{id}/action/composeVApp and are rate-limited (getting back OPERATION_LIMITS_EXCEEDED), and yet VMware has partially executed the command! One would expect any sensible rate-limiting would either accept or reject the command, rather than partially executing it and then saying that the operation wasn't allowed!

Perhaps we should not think of VMware's response as rate-limiting. Instead perhaps we should think of it as VMware saying "I am unwilling to finish executing your request at this time due to excessive activity. I may or may not have partially executed your response before deciding that you weren't allowed to do it; your system may be in an unexpected state (e.g. resources partially created, and/or stuck in a "pending" state); it is your responsibility to check what state the system is now in, and to do any rollback required (e.g. try to delete the partially created resources); you can then retry. However, I may also reject these subsequent calls as well (potentially having partly executed them), if there is excessive activity at that time."

Clearly it is hard to program against an API with these semantics!