GoogleCloudPlatform / fluent-plugin-detect-exceptions

A fluentd plugin that scans line-oriented log streams and combines exceptions stacks into a single log entry.
Apache License 2.0
192 stars 69 forks source link

Support nested C# exceptions #58

Closed eadred closed 5 years ago

eadred commented 5 years ago
igorpeshansky commented 5 years ago

@eadred Thank you for the contribution!

This plugin is designed to work with Stackdriver Error Reporting, which should recognize the resulting multiline message as a stack trace. I'm verifying this with the team right now whether the backend would recognize a nested C# exception if the stack trace were joined into a single log entry. If the backend won't recognize these, there's no point in handling them in the plugin, and we should file an Error Reporting feature request.

igorpeshansky commented 5 years ago

I've confirmed with the Error Reporting team that all 3 new exceptions were picked up by Error Reporting, but maybe not in the form you'd expect. For example, for the CSHARP_AGGREGATE_EXC value from the test, only the content between the arrows was captured.

For the record, a self-service test can be done by running:

echo "System.AggregateException: One or more errors occurred. (This is an exception) ---> System.InvalidOperationException: This is an exception
  at ExampleApp.AsyncExceptionExample.LowestLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 28
  at ExampleApp.AsyncExceptionExample.ThirdLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 23
  at ExampleApp.AsyncExceptionExample.SecondLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 17
  at ExampleApp.AsyncExceptionExample.TopLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 12
  --- End of inner exception stack trace ---
  at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  at System.Threading.Tasks.Task.Wait()
  at ExampleApp.Program.Main(String[] args) in c:/ExampleApp/ExampleApp/Program.cs:line 12
---> (Inner Exception #0) System.InvalidOperationException: This is an exception
  at ExampleApp.AsyncExceptionExample.LowestLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 28
  at ExampleApp.AsyncExceptionExample.ThirdLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 23
  at ExampleApp.AsyncExceptionExample.SecondLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 17
  at ExampleApp.AsyncExceptionExample.TopLevelMethod() in c:/ExampleApp/ExampleApp/AsyncExceptionExample.cs:line 12<---
" | \
gcloud beta error-reporting events report \
  --service manual \
  --service-version test-version \
  --project=$(gcloud config get-value project) \
  --message-file=/dev/stdin

(see the gcloud docs).

If the ingestion results don't look ok, let's open a feature request against Error Reporting before proceeding with this PR. For the values that work correctly, we can start reviewing those portions of the PR.

eadred commented 5 years ago

Sorry about the force push. I've now restored the original commit as it was at the time of the first review (https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/commit/1df893b5db924cbfa987fe44e1719979e70125da).

I've now managed to get the C# rules into a separate set of rules (in the most recent commit https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/commit/a991faaa09c74bb19a40bb6cb9a377bc021441ea). As I mentioned in an earlier comment, I've now had to cope with the fact that it is ambiguous as far as the rules are concerned whether you are seeing a C# or a Java exception. To implement this I've had to support the state machine having more than one possible current state. For example, it might be in either the :java_start_exception or the :csharp_start_exception state when it see a line like "foo.bar.baz.Exception: Something went wrong."

Unfortunately these changes have made the bench tests around 3 to 4 times slower because the ExceptionDetector.transition method can't exit early once it has found a match. I'm not sure there's much I can do about this - is it preferable to have the cleaner separation of rules with worse performance, or to go back to the mixing of C# and Java-like rules?

eadred commented 5 years ago

I checked how the new C# exceptions looked in error reporting, and it seems it is only the aggregate exception that doesn't get picked up properly. It is actually the other two types of exception I am more interested in getting handled properly, so if it meant this PR could get merged faster I could drop the logic for handling the aggregate exception. What do you think?

qingling128 commented 5 years ago

@eadred - Thanks for refactoring the code. Now I see why it was challenging to separate csharp and java states. We care a lot of performance, so 3X - 4X processing time is sufficient to make an exception for this case. We probably want to revert the https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/commit/a991faaa09c74bb19a40bb6cb9a377bc021441ea commit.

As to the aggregate exception, let's drop that part (please open a feature request with Error Reporting backend if needed) so that we can get this PR in.

eadred commented 5 years ago

Thanks for that.

I've now reverted https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/commit/a991faaa09c74bb19a40bb6cb9a377bc021441ea (as https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/pull/58/commits/7fa27c0d842d804a7fd81724a2916389fcaacca5) and also removed the handling for aggregate exceptions (https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/pull/58/commits/58889d353f3f7e92866d053dda31000f0353a1b2).

I've raised an issue with the Error Reporting team for handling AggregateExceptions - https://issuetracker.google.com/issues/139292720.

PS. I realised that my handling of aggregate exceptions wasn't dealing with the case where there are multiple inner exceptions, so I did put in a fix for that (https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/pull/58/commits/6ea0cb7f6d935668782073d3bd4770c24c3ddcc2). Although this has all been reverted anyway, I may want to re-add these changes in a future PR.

qingling128 commented 5 years ago

Thanks @eadred! This looks good to me. Seems like there are still pending comments from @igorpeshansky. He is on vacation and will be back on Monday (Aug 19). Is this something that can wait or is it urgent?

eadred commented 5 years ago

Thanks @qingling128. No problem if this is delayed - it isn't urgent.

qingling128 commented 5 years ago

Cool, thanks. Will wait for @igorpeshansky to take a second look once he's back.

xflag commented 5 years ago

@igorpeshansky Could you release a new version for this issue? Now I need this enhancement,I use .net core running in docker swarm,then collect log from docker logging driver to fluentd which add this plugin for multiline error logs.

qingling128 commented 5 years ago

I'll work with our oncall @davidbtucker to do a release.

qingling128 commented 5 years ago

Sent https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/pull/65.

qingling128 commented 5 years ago

Done: https://github.com/GoogleCloudPlatform/fluent-plugin-detect-exceptions/releases/tag/0.0.13