Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher.RaiseChangeEvents throws uncatchable IOException on thread pool thread

georg-jung commented 9 months ago

Description

I use Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher in polling mode. If polling fails with an IOException, it takes the process down as the exception is thrown on a thread pool thread where it can not be catched.

Related: #71003 #72462 Reading the related issues, it seems like @Jozkee is the right one to tag :-) cc: @danmoseley

Stacktrace

Unhandled exception. System.IO.IOException: Host is down : '/mnt/smbshare'
   at System.IO.Enumeration.FileSystemEnumerator`1.Init()
   at System.IO.DirectoryInfo.InternalEnumerateInfos(String path, String searchPattern, SearchTarget searchTarget, EnumerationOptions options)
   at Microsoft.Extensions.FileSystemGlobbing.Abstractions.DirectoryInfoWrapper.EnumerateFileSystemInfos()+MoveNext()
   at System.Collections.Generic.List`1.AddRange(IEnumerable`1 collection)
   at Microsoft.Extensions.FileSystemGlobbing.Internal.MatcherContext.Match(DirectoryInfoBase directory, String parentRelativePath)
   at Microsoft.Extensions.FileProviders.Physical.PollingWildCardChangeToken.CalculateChanges()
   at Microsoft.Extensions.FileProviders.Physical.PollingWildCardChangeToken.get_HasChanged()
   at Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher.RaiseChangeEvents(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

Details of my setup/environment

I don't think this issue is related to my setup/environment, just adding for completeness & reproducibility.

My code runs inside an ubuntu-chiseled-based docker container. On the unix container host, I have a CIFS mount defined in /etc/fstab like this:

//smbhost/smbshare$ /mnt/smbshare cifs credentials=/etc/smbcredentials_smbshare,iocharset=utf8,rw,file_mode=0777,dir_mode=0777 0 0

This mount is then further passed to the container:

worker:
  image: somecorp.azurecr.io/some-worker:latest
  read_only: true
  cap_drop:
    - ALL
  security_opt:
    - no-new-privileges:true
  environment:
    - DOTNET_USE_POLLING_FILE_WATCHER=1
  volumes:
    - /mnt/smbshare:/mnt/smbshare:Z

The mounted volume is only available at specific times though (say, 9 to 5 but not at night). Thus it will predictably go down and when it does it takes my worker process with it.

Reproduction Steps

Creation of my PhysicalFilesWatcher is quite straight forward:

_fileProvider = new PhysicalFileProvider(directoryToWatch);
_disposable = ChangeToken.OnChange(
    () => _fileProvider.Watch("**/*"),
    () =>
    {
        // Do something
    });

I have DOTNET_USE_POLLING_FILE_WATCHER=1set.

Point this PhysicalFilesWatcher to a path that becomes unavailable, in my case the host of a mounted network file system goes down. On the next poll, the process stops forcefully.

Expected behavior

PhysicalFilesWatcher does not throw any uncatchable exceptions.

One of:

IOExceptions are catched, like @danmoseley suggested but in all relevant places
There is an event or a similar interception possibility that can be used to handle these exceptions.

Actual behavior

PhysicalFilesWatcher takes processes down because it throws uncatchable exceptions.

Regression?

I don't think this is a regression.

Known Workarounds

Only: Don't use PhysicalFilesWatcher.

Configuration

docker container run --rm --entrypoint dotnet my-worker:latest --info

Host:
  Version:      8.0.0
  Architecture: x64
  Commit:       5535e31a71
  RID:          linux-x64

.NET SDKs installed:
  No SDKs were found.

.NET runtimes installed:
  Microsoft.AspNetCore.App 8.0.0 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 8.0.0 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  Not set

global.json file:
  Not found

Other information

No response

ghost commented 9 months ago

Tagging subscribers to this area: @dotnet/area-extensions-filesystem See info in area-owners.md if you want to be subscribed.

Issue Details

### Description I use `Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher` in polling mode. If polling fails with an IOException, it takes the process down as the exception is thrown on a thread pool thread where it can not be catched. Related: #71003 #72462 Reading the related issues, it seems @Jozkee is the right one to tag :-) cc: @danmoseley ### Stacktrace ``` Unhandled exception. System.IO.IOException: Host is down : '/mnt/smbshare' at System.IO.Enumeration.FileSystemEnumerator`1.Init() at System.IO.DirectoryInfo.InternalEnumerateInfos(String path, String searchPattern, SearchTarget searchTarget, EnumerationOptions options) at Microsoft.Extensions.FileSystemGlobbing.Abstractions.DirectoryInfoWrapper.EnumerateFileSystemInfos()+MoveNext() at System.Collections.Generic.List`1.AddRange(IEnumerable`1 collection) at Microsoft.Extensions.FileSystemGlobbing.Internal.MatcherContext.Match(DirectoryInfoBase directory, String parentRelativePath) at Microsoft.Extensions.FileProviders.Physical.PollingWildCardChangeToken.CalculateChanges() at Microsoft.Extensions.FileProviders.Physical.PollingWildCardChangeToken.get_HasChanged() at Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher.RaiseChangeEvents(Object state) at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool) at System.Threading.TimerQueue.FireNextTimers() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() ``` ### Details of my setup/environment > I don't think this issue is related to my setup/environment, just adding for completeness & reproducibility. My code runs inside an ubuntu-chiseled-based docker container. On the unix container host, I have a CIFS mount defined in `/etc/fstab` like this: ``` //smbhost/smbshare$ /mnt/smbshare cifs credentials=/etc/smbcredentials_smbshare,iocharset=utf8,rw,file_mode=0777,dir_mode=0777 0 0 ``` This mount is then further passed to the container: ```docker-compose worker: image: somecorp.azurecr.io/some-worker:latest read_only: true cap_drop: - ALL security_opt: - no-new-privileges:true environment: - DOTNET_USE_POLLING_FILE_WATCHER=1 volumes: - /mnt/smbshare:/mnt/smbshare:Z ``` The mounted volume is only available at specific times though (say, 9 to 5 but not at night). Thus it will predictably go down and when it does it takes my worker process with it. ### Reproduction Steps Creation of my PhysicalFilesWatcher is quite straight forward: ```csharp _fileProvider = new PhysicalFileProvider(directoryToWatch); _disposable = ChangeToken.OnChange( () => _fileProvider.Watch("**/*"), () => { // Do something }); ``` I have `DOTNET_USE_POLLING_FILE_WATCHER=1`set. Point this PhysicalFilesWatcher to a path that becomes unavailable, in my case the host of a mounted network file system goes down. On the next poll, the process stops forcefully. ### Expected behavior PhysicalFilesWatcher does not throw any uncatchable exceptions. One of: * IOExceptions are catched, like @danmoseley [suggested](https://github.com/dotnet/runtime/issues/71003#issuecomment-1181927888) but in all relevant places * There is an event or a similar interception possibility that can be used to handle these exceptions. ### Actual behavior PhysicalFilesWatcher takes processes down because it throws uncatchable exceptions. ### Regression? I don't think this is a regression. ### Known Workarounds Only: Don't use PhysicalFilesWatcher. ### Configuration `docker container run --rm --entrypoint dotnet my-worker:latest --info` ``` Host: Version: 8.0.0 Architecture: x64 Commit: 5535e31a71 RID: linux-x64 .NET SDKs installed: No SDKs were found. .NET runtimes installed: Microsoft.AspNetCore.App 8.0.0 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 8.0.0 [/usr/share/dotnet/shared/Microsoft.NETCore.App] Other architectures found: None Environment variables: Not set global.json file: Not found ``` ### Other information _No response_

Author:	georg-jung
Assignees:	-
Labels:	`area-Extensions-FileSystem`
Milestone:	-

danmoseley commented 9 months ago

I don't have (or don't remember) context, but is there already an established pattern in file watcher for exceptions on its threadpool threads? If so any interest in offering a PR applying it in this path?

georg-jung commented 9 months ago

Sorry for the delay. I don't think there is a very clear established pattern.

FileSystemWatcher

If PhysicalFileWatcher uses the FileSystemWatcher backend, it handles the FileSystemWatcher.Error Event, which's docs read (emphasis mine):

This event is raised whenever something prevents the FileSystemWatcher object from monitoring changes. For example, if the object is monitoring changes in a remote directory and the connection to that directory is lost, the Error event is raised.

In that case, the listeners will be notified:

private void OnError(object sender, ErrorEventArgs e)
{
    // Notify all cache entries on error.
    foreach (string path in _filePathTokenLookup.Keys)
    {
        ReportChangeForMatchedEntries(path);
    }
}

Polling

The polling implementations PollingWildCardChangeToken.cs and PollingFileChangeToken.cs do however not contain any try-catch blocks at all.

How to proceed

From the top of my head there are multiple options:

Make PollingWildCardChangeToken and PollingFileChangeToken signal on error. They are public types though and if they are not used in the context of PhysicalFileWatcher, errors can be catched when geting the HasChanged property's value.
Add some OnError mechanism to PollingWildCardChangeToken and PollingFileChangeToken. This would be a new public API though.
Just have a try-catch-block in PhysicalFilesWatcher where it polls the tokens. This would get the polling behaviour in line with FileSystemWatcher-backed instances. It also wouldn't add/change any public API surface. It also wouldn't change any public API's behaviour - except from not taking down processes due to uncatchable exceptions on the thread pool - and signalling the listeners instead.

Thus the third option seems to be preferrable to me. It also seems like an almost non-invasive change (any chance to get this in a servicing release?) to me.

I'd be happy to create a PR when we agree on how to proceed 👍.

joel-jeremy commented 9 months ago

+1 to this. I have also encountered the same problem there is currently no way to recover when the connection to the remote directory is lost.

jozkee commented 9 months ago

If PhysicalFileWatcher uses the FileSystemWatcher backend, it handles the FileSystemWatcher.Error Event

There are also uncaught exceptions on FSW when certain folders can't be enumerated, see callstack in https://github.com/dotnet/runtime/issues/91879.

jozkee commented 9 months ago

There's a 4th option, DirectoryInfoWrapper uses _directoryInfo.EnumerateFileSystemInfos("*", SearchOption.TopDirectoryOnly) that doesn't enable EnumerationOptions.IgnoreInaccessible. I would expect that using such option could fix this case.

However, it's been historically annoying to deal with unexpected exception in RaiseChangeEvents:

The issues you linked.
https://github.com/dotnet/runtime/issues/41737
https://github.com/dotnet/runtime/issues/65829

Wrapping the whole method body in a try-catch could also be best if its safe to do so.

georg-jung commented 9 months ago

Judging from it's docs alone I don't think using IgnoreInaccessible would fix cases where a network share goes down:

Gets or sets a value that indicates whether to skip files or directories when access is denied (for example, UnauthorizedAccessException or SecurityException).

Looking at it's source I'm unsure for which cases EBADF stands. The explicit "Host is down" message in my original stack trace could however indicate the internal error code is rather Error_EHOSTDOWN. Which means, if I understand it correctly, option 4 wouldn't fix this issue.

jozkee commented 9 months ago

I see, unless we modify IgnoreInaccessible on Linux, option 4 won't work. @georg-jung would you like to explore option 3 and send a PR?

dotnet / runtime