Blocking on Azure VM Creation

selshowk commented 3 years ago

I'm trying to launch VMs across AWS, GCP and Azure and have noted the following:

the AWS VM create function returns within a second and then I can wait on the VM by checking its state (as in the startup example) which takes ~15s.
the Azure VM creation call takes ~1m and then "waiting" on the VM state aftewards takes no time (because the create call essential blocks until the VM is ready).
I don't recall the GCP behavior but the VM startup time is closer to AWS than azure

I'm using cloudbridge inside an asyncio application so a ~1m+ blocking call is problematic for me. Looking at the examples on Azure's website I see the Azure API returns some kind of poller and the create_vm call in azure_client blocks on the .result().

Is it possible to not block here but simply return a VM identifier? I have not been able to find enough documentation on the underlying API to even know if this is an option but I wanted to ask.

nuwang commented 3 years ago

@selshowk This is interesting because we haven't actually tried to use cloudbridge in an async way before. Are you using a library to patch cloudbridge to being async? Even in AWS, there are a few places where a wait occurs: https://github.com/CloudVE/cloudbridge/blob/5820f681c33d74c93ede22e56d8c7a5f7b9f54e7/cloudbridge/providers/aws/services.py#L781 So if the create VM call returns in 1 second, then presumably, it's being monkeypatched away but for some reason, the Azure implementation isn't?

nuwang commented 3 years ago

Another thing we have been considering recently is to have all operations return a "background task", that can be polled and waited upon. One reason for this is that on AWS in particular, we've been experiencing some difficulties (see here for context: https://github.com/CloudVE/cloudbridge/pull/257) which we've worked around for now. If this were to be implemented, it should also make it easier to write cloudbridge code that's compatible with async applications, since you could always send the long-running operation to a background thread and not block the main event loop. However, this would be a fairly major interface change so I don't know whether there's enough bandwidth to accomplish it.

selshowk commented 3 years ago

Are you using a library to patch cloudbridge to being async?

No @nuwang I've written a wrapper that uses a threadpoolexecutor to call cloudbridge functions from async code. The code looks like this:

def asyncify(f):
    """ """

    @wraps(f)
    async def async_wrapper(*args, **kwargs):
        loop = asyncio.get_running_loop()
        result = await loop.run_in_executor(None, partial(f, *args, **kwargs))
        return result

    return async_wrapper

I use this to wrap my own functions which call cloudbridge functions (e.g. to create a firewall then a VM, etc). The times I'm qouting above are happening inside the threads. Note that the cloudbridge azure vm creation code has a lot of serialized azure calls (creating NIC, etc) so this might be the cause of the slowdown. I have not tried to add timing information inside the cloudbrige functions to see which part blocks.

But my statement about AWS does not depend on anything async. The call to the AWS providers provider.compute.instances.create() function takes only ~1-2s on AWS so it returns before the VM is created (I presume). Here's some logging:

2020-11-25 15:33:55 BEGIN [_launch_instance]
2020-11-25 15:33:56 starting instance create
2020-11-25 15:33:58 done instance create; blocking until ready
2020-11-25 15:34:03 done instance 'wait_til_ready' 
2020-11-25 15:34:03 END [_launch_instance]
2020-11-25 15:34:03 METRIC Func [_launch_instance] took 8.041s      (created i-030a36247a7c6bf59)

Note that most of the 8s is in the 5s between finishing instance create and this function returning inst.wait_till_ready(). Because of this, I can get more parallelization by simply not calling wait_till_ready().

The same code running on azure gives:

2020-11-25 13:13:50 starting instance create
2020-11-25 13:15:14 done instance create; blocking until ready
2020-11-25 13:15:14 done instance 'wait_til_ready' 
2020-11-25 13:15:14 END [_launch_instance]
2020-11-25 13:15:14 METRIC Func [_launch_instance] took 84.394s     (created sheer-cb-202011-scheduler-89a3b7)

so the create call above takes ~1.5m and the wait_til_ready call takes no time.

selshowk commented 3 years ago

One reason for this is that on AWS in particular, we've been experiencing some difficulties (see here for context: #257) which we've worked around for now.

This issue seems to be related to why the AWS call takes so little time. It seems like AWS is returning as soon as it gets a reference to the instance, even if the instance is not ready. This is similar to the "background task" idea you mentioned but is happening at the provider level. This kind of behavior would be very useful for async applications. I suspect that it could already be achieved with Azure simply by not waiting on the result as you do here.

Maybe one way to do this is to not pass the "result()" to the AzureInstance but just the poller/future and then make self._vm a property wrapping that poller that resolves it (calls "result()" on it) when self._vm is accessed.

nuwang commented 3 years ago

@selshowk In the current implementation, if we don't return the result() from Azure, we would have to do something like what you suggested, treat the self._vm property as a lazy property which is waited for the first time it's accessed. We'd be happy to accept any changes there. Just wanted to additionally note that Azure is in fact a lot slower than AWS at VM launch operations, as can be seen by the test results here: https://travis-ci.com/github/CloudVE/cloudbridge

However, since the event loop is not being blocked since you're offloading everything to a background thread anyway, can you perhaps use multiple threads as an alternative?

selshowk commented 3 years ago

However, since the event loop is not being blocked since you're offloading everything to a background thread anyway, can you perhaps use multiple threads as an alternative?

That is what we're doing now but there's a worry that if we want to launch 100 instances at once this would mean spawning a large number of threads and there is some concern around this (passed on to me from someone with a lot more experience with threads and asyncio -- not something I've hit myself).

As we develop our application further I will see if this is a problem and then perhaps come back to the solution I mentioned above (if needed).

I am aware that AWS is much faster than azure but the issue is not so much that one is faster than the other than the fact that cloudbridge/aws returns immediately after requesting a VM (does not wait on any result or VM state) while for Azure it waits until the VM is ready. The former behavior is in general much easier to parallelize.

CloudVE / cloudbridge

Blocking on Azure VM Creation #262