awslabs / kinesis-agent-windows

An extensible Windows agent that ingests logs and metrics to AWS services such as Kinesis Stream, Kinesis Firehose, CloudWatch Logs and CloudWatch.
Apache License 2.0
39 stars 22 forks source link

FileSystemWatcher in DirectorySource reliability? #2

Closed softwareguy74 closed 5 years ago

softwareguy74 commented 5 years ago

I have known over the years in working with the FileSystemWatcher that it is not totally reliable and in may be more to do with the underlying OS, than the library itself. Such as missing events, and or just stop working and needing to restart it. This is even more true when monitoring UNC paths. I have read several articles that suggest using a polling mechanism to check for changes to a file every X seconds in place of using FSW would be more reliable, especially for UNC paths. Or, at the very least if using the FSW, you need to have a separate "watcher" to make sure FSW is still running, and if not, restart it.

Has this been taken into consideration with the design of the DirectorySource class?

aspcompiler commented 5 years ago

Hi @softwareguy74 ,

Thank you very much for your suggestions. We actually do have a polling mechanism. On a busy system, the FileSystemWatcher can fire extremely rapidly. So we buffer the events and process at a timer interval. In the timer event, we do poll the file system. See: https://github.com/awslabs/kinesis-agent-windows/blob/3b995446156c742e5196fb008b25dcae390dedf5/Amazon.KinesisTap.Core/Sources/DirectorySource.cs#L306.

We did observe that FileSystemEvent not alway fire, but any disturbance (such as listing files or reading the file length) seems to trigger it to fire.

So far we did not encounter this issue in our internal production. That said, we do not have many data points on UNC paths. If you encounter any problem, please do let us know.

softwareguy74 commented 5 years ago

The issue with the UNC paths failing is when a file server is taken off line and brought back up. The FSW will fail to continue, and needs to be restarted. So there needs to be some mechanism to check if FSW is no longer working and restart it. It's possible some people put log files on UNC paths so this could be a real problem out in the real world.

aspcompiler commented 5 years ago

Agree. Thanks for the clarification!

We had a recent commit for a slightly difference use case but may not cover your use case: https://github.com/awslabs/kinesis-agent-windows/commit/c9182fc11c7aedf8fb1fd372d654da4db1fa0203. We'll post back on this thread once we have verified or implemented the scenario of file server taken off/brought back up.

aspcompiler commented 5 years ago

Hi @softwareguy74 , we have added a feature to recreate FSW when a directory becomes unavailable in https://github.com/awslabs/kinesis-agent-windows/commit/bd8ba3a34dc26b1699eaba4fbb78428cc24f8385. Please try version 1.0.0.128 or later from our beta download page: https://s3-us-west-2.amazonaws.com/kinesis-agent-windows/beta/index.html. Let us know if the fix addresses your concern. Thank you very much.