aws / aws-xray-sdk-dotnet

The official AWS X-Ray SDK for .NET.
Apache License 2.0
114 stars 64 forks source link

Race Condition found in Entity.AddException when Concurrently Adding Subsegments #306

Closed luizfds closed 2 weeks ago

luizfds commented 2 months ago

I've identified a race condition in the Entity.AddException method when subsegments are being added concurrently. The issue arises from the concurrent access and modification of the _lazySubsegments list, which can lead to unpredictable behavior or exceptions during the execution of the DescribeException method within AddException. I am using powertools-lambda-dotnet, but after digging into the logs I found the issue originates from aws-xray-sdk-dotnet.

image

Steps to Reproduce:

While a definitive reproduction might be challenging due to the non-deterministic nature of race conditions, here's a scenario that increases the likelihood of encountering the issue:

  1. Create an instance of Entity (e.g., a Segment).
  2. In one thread, trigger the AddException method on the Entity instance.
  3. Concurrently, in another thread or multiple threads, repeatedly call AddSubsegment on the same Entity instance to simulate concurrent modifications to the _lazySubsegments list.
  4. Observe if any exceptions, such as InvalidOperationException, are thrown during the execution of AddException or subsequent operations.

I've setup the following code snippet to simulate the error:

var entity = new Segment("TestSegment");
const int EXCEPTION_ITERATIONS = 100;
const int SUBSEGMENT_ITERATIONS = 100_000;

var raceConditionOccurred = false;
string? exceptionMessage = null;
var cancellationTokenSource = new CancellationTokenSource();

// Thread 1: Trigger AddException
var addExceptionTask = Task.Run(() =>
{
    Parallel.ForEach(Enumerable.Range(0, EXCEPTION_ITERATIONS), _ =>
    {
        try
        {
            var exception = new Exception("Test Exception");
            entity.AddException(exception);
        }
        catch (InvalidOperationException ex)
        {
            raceConditionOccurred = true;
            exceptionMessage = ex.Message;
            cancellationTokenSource.Cancel();
        }
    });
});

// Thread 2: Add Subsegments concurrently using Parallel.ForEach
var addSubsegmentsTask = Task.Run(() =>
{
    try
    {
        Parallel.ForEach(Enumerable.Range(0, SUBSEGMENT_ITERATIONS), new ParallelOptions { CancellationToken = cancellationTokenSource.Token }, _ =>
        {
            entity.AddSubsegment(new Subsegment("TestSubsegment"));
        });
    }
    catch (OperationCanceledException) { }
});

Task.WaitAll(addExceptionTask, addSubsegmentsTask);

if (raceConditionOccurred)
{
    _output.WriteLine($"Race condition error occurred: {exceptionMessage}");
}

Assert.True(raceConditionOccurred, "Expected a race condition error but none occurred.");

Expected Behavior:

The AddException method should execute correctly even when subsegments are being added concurrently, without throwing any exceptions or causing data inconsistencies.

Suggested Fix:

Add a lock to ensure thread-safe access to the _lazySubsegments list during the execution of Entity.AddException(Exception e).

 /// <summary>
 /// Adds the exception to cause and set this segment to has fault.
 /// </summary>
 /// <param name="e">The exception to be added.</param>
 public void AddException(Exception e)
 {
     HasFault = true;
     Cause = new Cause();
     List<Subsegment> subsegmentsCopy;
     lock (_lazySubsegments.Value)
     {
         subsegmentsCopy = [.. _lazySubsegments.Value]; // Create a copy to avoid holding the lock during serialization
     }

     Cause.AddException(AWSXRayRecorder.Instance.ExceptionSerializationStrategy.DescribeException(e, subsegmentsCopy));
 }
luizfds commented 2 weeks ago

Hi there,

Any updates on this issue? By the way, I already have an open PR addressing it: PR #307.

vastin commented 2 weeks ago

Hi Luiz, the PR has been reviewed and merged. Thank you for contribution!