Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.92k stars 440 forks source link

TimerTrigger in Azure Function in Docker container on Kubernetes runs on every instance (.NET Core) #6015

Open MrImpossibru opened 4 years ago

MrImpossibru commented 4 years ago

TimerTrigger to run only once for all instances makes a lease in a BlobStorage. Name of the container depends on a Machine name in case of Kubernetes and it's unique for all instances. Because of that TimerTrigger runs on all instances at the same time.

As temporary solution I have to use env. variables WEBSITE_SITE_NAME and WEBSITE_INSTANCE_ID to make it work as expected, but maybe it's better to provide a way to override default container name for Azure Functions.

mhoeger commented 4 years ago

@divyagandhii - I feel like we faced a similar issue with Linux. If so, mind weighing in?

divyagandhisethi commented 4 years ago

@mhoeger - you are right. I fixed this for linux consumption here: https://github.com/Azure/azure-functions-host/blob/d571d25276c72778339c78d48c14c3c680d52b82/src/WebJobs.Script/Host/ScriptHostIdProvider.cs#L49-L53

We might have to do something similar for Kubernetes as well.

divyagandhisethi commented 4 years ago

cc @SatishRanjan

divyagandhisethi commented 4 years ago

cc @anirudhgarg

mhoeger commented 4 years ago

@divyagandhii - do you know what the status of this bug is?

esimkowitz commented 2 years ago

I am also still seeing this on the .NET 6 isolated runtime Docker image. I see that this has been open for two years, are there any updates? This is a blocking issue that prevents my team from properly scaling our timer-triggered functions since in order to prevent duplicate runs, we can only run one replica at a time…

anirudhgarg commented 2 years ago

Assigning to @JatinSanghvi to have a look,

MrImpossibru commented 2 years ago

@esimkowitz Last time I had this issue I refactored my service to use WebJobs SDK instead of Azure Functions. It uses twice less memory and works more predictable in a container in Kubernetes.

esimkowitz commented 2 years ago

Also to clarify, we use Kubernetes to host our Azure Functions so when I say instance/replica of the function app, I am referring to separate pods running the same function app container with the same storage account.

@MasterKolka unfortunately I don't think WebJobs is an option for us since we use the other triggering mechanisms of Azure Functions pretty heavily.

MrImpossibru commented 2 years ago

@esimkowitz As I remember the main problem is to give a different "instance name" for each pod (which is required to prevent multiple executions by different instances). With a webjobs sdk this could be solved in some way. And WebJobs SDK supports all Azure Functions triggers except http (which could be solved by using ASP Controllers). Azure Functions SDK is based on WebJobs SDK.

esimkowitz commented 2 years ago

@lpapudippu Any updates on the investigation here?

lpapudippu commented 2 years ago

@v-shenoy Please look in to it. cc: @raorugan

v-shenoy commented 2 years ago

Taking a look into this.

v-shenoy commented 2 years ago

This problem has had multiple issues created for it. Here's one by @TsuyoshiUshio. The work around he mentions (and I tested that it works) is to manually set the AzureFunctionsWebHost__hostid environment variable on the containers.

https://github.com/Azure/azure-functions-host/blob/d571d25276c72778339c78d48c14c3c680d52b82/src/WebJobs.Script/Host/ScriptHostIdProvider.cs#L49-L66

As Divya mentions, I could make changes here to include

else if (environment.IsContainer())
{
    hostId = <set-host-id-to-something>;
}

However, I am not sure what to use as hostId that will be same for all the containers running the image.

divyagandhisethi commented 2 years ago

cc @pragnagopa .

@v-shenoy The host id needs to be something unique per pod instance. For example - you can use the pod name or pod id. In other Linux skus, we use the container name.

divyagandhisethi commented 2 years ago

This problem has had multiple issues created for it. Here's one by @TsuyoshiUshio. The work around he mentions (and I tested that it works) is to manually set the AzureFunctionsWebHost__hostid environment variable on the containers.

https://github.com/Azure/azure-functions-host/blob/d571d25276c72778339c78d48c14c3c680d52b82/src/WebJobs.Script/Host/ScriptHostIdProvider.cs#L49-L66

As Divya mentions, I could make changes here to include

else if (environment.IsContainer())
{
    hostId = <set-host-id-to-something>;
}

However, I am not sure what to use as hostId that will be same for all the containers running the image.

Also, just checking environment.IsContainer() won't be sufficient. You need to explicitly check if we are in a managed kubernetes environment. There are helper functions already implemented for that.

v-shenoy commented 2 years ago

This problem has had multiple issues created for it. Here's one by @TsuyoshiUshio. The work around he mentions (and I tested that it works) is to manually set the AzureFunctionsWebHost__hostid environment variable on the containers. https://github.com/Azure/azure-functions-host/blob/d571d25276c72778339c78d48c14c3c680d52b82/src/WebJobs.Script/Host/ScriptHostIdProvider.cs#L49-L66

As Divya mentions, I could make changes here to include

else if (environment.IsContainer())
{
    hostId = <set-host-id-to-something>;
}

However, I am not sure what to use as hostId that will be same for all the containers running the image.

Also, just checking environment.IsContainer() won't be sufficient. You need to explicitly check if we are in a managed kubernetes environment. There are helper functions already implemented for that.

It doesn't have to be managed kubernetes, right? The issue still happens when deploying to a kubernetes cluster directly, using func kubernetes deploy

v-shenoy commented 2 years ago

cc @pragnagopa .

@v-shenoy The host id needs to be something unique per pod instance. For example - you can use the pod name or pod id. In other Linux skus, we use the container name.

@divyagandhisethi It already uses pod name attached, with a hash.

image image

Also, it shouldn't be unique per pod instance, right? There could be multiple pods running the same function image as @esimkowitz does (and also shown in the image). You want all of these pods to get the same host id for timer trigger.

JatinSanghvi commented 2 years ago

I agree with Vighnesh. The issue is caused because the function apps running inside the container are using machine name as host ID, which is unique for each pod. We are not looking for unique ID per pod here, but an ID that's common across all pod instances for the same function app image, but distinct from the ID used by pods belonging to a different function app image.

I am inclined to use hash generated from container image path (sans the tag name). This ensures that if container code is upgraded to point to new tag of the same container image, the newly created pods continue from the same checkpoint where previous pods had left. This is as not as critical for timer-trigger as it might be for triggers for tabular event sources.

It should probably also work if we use assembly name containing the functions as host ID as being done in DefaultHostIdProvider.

Feel free to add comment if you see any issues with using assembly name, or know of a better method for host ID computation.

esimkowitz commented 2 years ago

I have incorporated @v-shenoy's workaround of setting the hostId environment variable and have confirmed that it unblocks my scenario (in our case we deploy one function per namespace and the namespace names are consistent across clusters). Looking forward to your permanent solution!

JatinSanghvi commented 2 years ago

A permanent solution that requires change in host ID generation logic, will also be a change with potential for trigger misses. There will be users who are having just a single container up and running. If we change the host ID generation logic and user updates their function host version, their next app revision will start from a clean state instead of continuing from the previous state.

JatinSanghvi commented 2 years ago

On second thought, every new container should already be getting assigned a separate host name which does not get maintained between deployments. So we will not be breaking an existing working functionality in that sense.

esimkowitz commented 2 years ago

Maybe it would be useful to just document the host ID behavior publicly and let service devs pick a value they know will be consistent for their applications?

JatinSanghvi commented 2 years ago

Hi @esimkowitz, I am personally leaning towards just having a proper documentation in place. I did not find an environment variable inside running function app container that meets the requirement to be used as host ID. We thought of using 'container image path' but I don't think that information is available to be inserted automatically while composing Dockerfile, building the docker image or running the container. It will only need to be passed by user at some point, which is no less manual than setting hostid environment variable.

We also thought of having a logic in functions host similar to DefaultHostIdProvider but @v-shenoy pointed out that it won't work for non-.NET function apps.

v-shenoy commented 2 years ago

@divyagandhisethi

pragnagopa commented 2 years ago

Tagging @fabiocav for his input

pinkfloydx33 commented 4 months ago

So are we supposed to set hostid to something static when running in Kubernetes?

If I have a Deployment containing a function app and it is scaled (by Keda, or manually) to 2 instances, should the hostid be the same for both replicas of the application?

I've tried doing this and it seems to work fine, but it's really unclear from the documentation and the thread above. I'll note that when doing this, there will be one host file created in blob storage, and the metadata on that file will contain a FunctionInstance property, which I suppose is doubly confusing since there's two instances/replicas using this hostid. I did notice that if you scale the application to 0 and then back up again, the FunctionInstance metadata will be updated to reflect some new instance.

I'm not sure if this is an indicator of the fact that it's the wrong thing to be doing. Can anyone clarify the proper thing to do?