Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.92k stars 439 forks source link

Improve file change detection to be more reliable #1351

Open ahmelsayed opened 7 years ago

ahmelsayed commented 7 years ago

It appears that file change notifications are highly unreliable especially with Azure Files. The UI has a lot of issues when the runtime misses file change notifications for whatever reason. maybe a periodic polling of the file system for changes or an API to force host restart could help users force recompiling or reloading their functions.

christopheranderson commented 7 years ago

@ahmelsayed - Do you have any specific scenarios where this happens frequently? We've investigated these issues in the past and it can be hard to reproduce reliably to test. We have an improvement to try to centralize and improve these notifications.

ahmelsayed commented 7 years ago

no specific scenarios, we just know that it happens and as you said is very hard to reproduce or track down, but we know it does happen. Usually with changes to run.csx that require a recompilation.

ahmelsayed commented 7 years ago

clearing milestone to discuss again in triage

christopheranderson commented 7 years ago

Should split this up into the improvements we decide to go with.

brettsam commented 5 years ago

Moving this to Triaged. Just had a CRI where almost all instances for a site went down for a couple of reasons:

This behavior also causes the data role to actively refuse scale-out requests as it looks like no functions are running so it doesn't want to overprovision.

I agree with @ahmelsayed's original suggestion -- our FileSystemWatcher code should be more resilient. Maybe maintain an internal list of files and periodically check by itself and fire events, rather than relying on the OS events?

Another option is to get more strict about our "Run from package" suggestion -- this case would have been prevented by this.

ankitkumarr commented 5 years ago

This may get pushed to another Sprint as I have some higher priority items for this Sprint. Please let me know if anyone has concerns. If so, please state them here and I can re prioritize this accordingly.

ahmelsayed commented 5 years ago

Dotnet file watchers don't work in a linux container without DOTNET_USE_POLLING_FILE_WATCHER=true. I'd close this issue.

ankitkumarr commented 5 years ago

@ahmelsayed, maybe this issue got a little side tracked.

But, file watchers even for Windows have been a bit unreliable as @brettsam said above. Specially with the case of app_offline.htm and we are seeing a decent amount of such cases.

Brett had suggested offline that maybe we should move FileWatcher from JobHost to WebHost level such that even if JobHost restarts due to a file change, the FileWatcher will not restart, and hopefully not miss any file events.

Current Scenario --

  1. File Watcher is running and deployment starts
  2. App_offline.htm is generated
  3. Files are updated
  4. Host restarts, file watcher shuts down
  5. Host looks at the app_offline.htm, declares that it's offline
  6. App_offline.htm is removed
  7. File Watcher starts
  8. Host stays offline

With the fix, it should be --

  1. File Watcher is running and deployment starts
  2. App_offline is generated
  3. files are updated
  4. Host restarts, but File Watcher continues to detect changes
  5. Host looks at the app_offline.htm, declares that it's offline
  6. App_offline is removed
  7. WebHost catches it and restarts
fabiocav commented 4 years ago

Moving this to Triaged as I'm not sure we're tracking this work for sprint 60. Please adjust if needed.

brettsam commented 1 year ago

This issue still exists today. We can miss app_offline.htm delete events and leave a site offline until a restart, causing availability issues. Deploying via Run from Package fixes this, but not every site has moved there.

We need to revisit this, just an initial thought (well, maybe not initial):

We already have a detector for this that suggests moving the deployment to Run from Package, but you almost need to get bitten by downtime to see it.