Azure / azure-webjobs-sdk

Azure WebJobs SDK
MIT License
737 stars 358 forks source link

The lease ID specified did not match the lease ID for the blob crashes worker. #2103

Open lostllama opened 5 years ago

lostllama commented 5 years ago

I'm getting the following exception when using WebJobs using Microsoft.Azure.WebJobs.Core v 3.0.3 at a rate of about 55x per month:

Microsoft.WindowsAzure.Storage.StorageException: The lease ID specified did not match the lease ID for the blob.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T]
at Microsoft.Azure.WebJobs.Host.StorageBaseDistributedLockManager.SingletonLockHandle.RenewAsync (C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host.Storage\Singleton\BlobLeaseDistributedLockManager.cs:347)
at Microsoft.Azure.WebJobs.Host.SingletonManager.RenewLeaseCommand.ExecuteAsync (C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Singleton\SingletonManager.cs:332)
at Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer.RunAsync (C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs:147)
at Microsoft.Azure.WebJobs.Host.Timers.WebJobsExceptionHandler.<>c__DisplayClass3_0.<OnUnhandledExceptionAsync>b__0 (C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\WebJobsExceptionHandler.cs:54)
at System.Threading.Thread.ThreadMain_ThreadStart
at System.Threading.ExecutionContext.RunInternal

Repro steps

Provide the steps required to reproduce the problem

var host = new HostBuilder()
    .UseEnvironment("sandbox")
    .UseServiceProviderFactory(new ServiceProviderFactory(s => GetContainer(s)))
    .ConfigureWebJobs(builder => {
        builder.UseWebJobsStartup<Startup>();
        builder.UseHostId($"worker-v{Assembly.GetAssembly(typeof(Program)).GetName().Version.ToString().Replace('.', '-')}");
        builder.AddAzureStorageCoreServices();
        builder.AddAzureStorage();
        builder.AddTimers();
        builder.AddExecutionContextBinding();
    })
    .ConfigureAppConfiguration((context, builder) => {
        /* config setup */
    })
    .ConfigureServices((hostingContext, services) => {
        /* additional service setup */
    })
    .ConfigureLogging((hostingContext, logging) => {
        /* logging configuration */
    })
    .UseConsoleLifetime()
    .Build();
await host.RunAsync();

Startup simply contains:

builder.Services
    .AddSingleton<INameResolver, WorkerNameResolver>()
    .Configure<QueuesOptions>(o =>
    {
        o.MaxPollingInterval = TimeSpan.FromSeconds(15);
        o.MaxDequeueCount = MaxDequeueCount; 
    });

Referencing:

internal class WorkerNameResolver : INameResolver
{
    public string Resolve(string name)
    {
        if (name == "TaskQueue")
        {
            var taskScheduleNameProvider = new Core.Tasks.VersionedTaskScheduleNameProvider();
            return taskScheduleNameProvider.GetName();
        }

        throw new ArgumentException("$Unrecognized NameResolver pattern '{name}'.");
    }
}

Expected behavior

WebJobs worker runs without crashing.

Actual behavior

WebJobs worker occasionally crashes with above issue.

Known workarounds

N/A

Related information

Provide any related information

lostllama commented 5 years ago

I've seen this 27 times today alone in our production cluster.

lostllama commented 5 years ago

@fabiocav

Does anyone care about this one? This is causing our webjobs application to cycle multiple times a day. I've created the following workaround which I'm about to test:

bool continueRunning = false;
do
{
    try
    {
        continueRunning = false;
        await host.RunAsync();
    }
    catch (Microsoft.WindowsAzure.Storage.StorageException se) when (se.Message.Contains("lease ID") || se.Message.Contains("Singleton lock renewal"))
    {
        continueRunning = true;
        logger.LogError(se, $"Worker failed: {se.Message}");
    }
}
while (continueRunning);

This workaround is hardly ideal.

lostllama commented 5 years ago

OK, so my workaround doesn't work. It appears that I missed a fact that the lock manager employed by webjobs is actually starting its own thread, and the errors are happening there, crashing the application.

lostllama commented 5 years ago

My new workaround was implementing my own IDistributedLockManager (which the obsolete flag suggests I shouldn't do, but it's that or my application crashes). This appears to have solved the problem.

Alicedev25 commented 5 years ago

@lostllama Could you please explain how did you implement the workaround i am facing the same issue! Thanks!

Alicedev25 commented 5 years ago

@fabiocav Can you have a look at this issue please! Thanks!

lostllama commented 5 years ago

@lostllama Could you please explain how did you implement the workaround i am facing the same issue!

@Alicedev25 I'm not sure what the "account" property in the interface is supposed to do, but I took "lockId" only as the the unique key. So use at your own risk, and I take no responsibility if it doesn't work how you expect :) Note that IDistributedLockManager has an Obsolete attribute stating: "Not ready for public consumption."

Now, I modified mine to use Mongo as a lease store, but you can change it to whatever you're using (e.g. SQL).

#pragma warning disable CS0618
public class MongoLockManager : IDistributedLockManager
{
    private readonly IMongoCollection<Data.WorkerLock> _workerLockCollection;
    private readonly ILogger _wrapperLogger;

    private class LockHandle : IDistributedLock
    {
        public string LockId { get; set; }
        public string LeaseId { get; set; }
        public ObjectId MongoId { get; set; }
        public TimeSpan LockPeriod { get; set; }
    }

    public MongoLockManager(IMongoDatabase db, ILoggerFactory loggerFactory)
    {
        _workerLockCollection = db.GetCollection<Data.WorkerLock>(Data.CollectionNames.WorkerLocks);
        _wrapperLogger = loggerFactory.CreateLogger<MongoLockManager>();
    }

    public async Task<string> GetLockOwnerAsync(string account, string lockId, CancellationToken cancellationToken)
    {
        var lockObject = await _workerLockCollection
            .Find(
                Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockId)
                & Builders<Data.WorkerLock>.Filter.Gte(l => l.LockExpires, DateTime.UtcNow)
                & Builders<Data.WorkerLock>.Filter.Eq(l => l.Account, account)
            )
            .FirstOrDefaultAsync(cancellationToken);

        return lockObject?.LockOwnerId;
    }

    public async Task ReleaseLockAsync(IDistributedLock lockHandle, CancellationToken cancellationToken)
    {
        if (!(lockHandle is LockHandle handle))
        {
            return;
        }

        await _workerLockCollection
            .DeleteOneAsync(
                Builders<Data.WorkerLock>.Filter.Eq(l => l.Id, handle.MongoId)
                & Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, handle.LockId)
                & Builders<Data.WorkerLock>.Filter.Eq(l => l.LeaseId, handle.LeaseId),
                cancellationToken: cancellationToken
            );
    }

    public async Task<bool> RenewAsync(IDistributedLock lockHandle, CancellationToken cancellationToken)
    {
        if (!(lockHandle is LockHandle handle))
        {
            return false;
        }

        var result = await _workerLockCollection
            .UpdateOneAsync(
                Builders<Data.WorkerLock>.Filter.Eq(l => l.Id, handle.MongoId)
                & Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, handle.LockId)
                & Builders<Data.WorkerLock>.Filter.Eq(l => l.LeaseId, handle.LeaseId),
                Builders<Data.WorkerLock>.Update.Set(l => l.LockExpires, DateTime.UtcNow.Add(handle.LockPeriod)),
                cancellationToken: cancellationToken
            );

        var updatedLock = result.IsModifiedCountAvailable && result.ModifiedCount > 0;
        if (!updatedLock)
        {
            var entries = await _workerLockCollection
                .CountDocumentsAsync(
                    Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, handle.LockId),
                    cancellationToken: cancellationToken
                );

            if (entries > 0)
            {
                _wrapperLogger.LogInformation($"Unable to reacquire lock: {handle.LockId}.");
            }
            return false;
        }

        return updatedLock;
    }

    public async Task<IDistributedLock> TryLockAsync(string account, string lockId, string lockOwnerId, string proposedLeaseId, TimeSpan lockPeriod, CancellationToken cancellationToken)
    {
        var lockEntity = new Data.WorkerLock()
        {
            Account = account,
            LockId = lockId,
            LockOwnerId = lockOwnerId ?? Environment.MachineName,
            LeaseId = proposedLeaseId ?? Guid.NewGuid().ToString(),
            LockExpires = DateTime.UtcNow + lockPeriod,
            HostName = Environment.MachineName
        };

        // Try and find existing lock object
        var existingLockMongoId = (await _workerLockCollection
            .Find(Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockId))
            .FirstOrDefaultAsync(cancellationToken))?.Id;

        if (existingLockMongoId != null)
        {
            lockEntity.Id = existingLockMongoId.Value;

            var updateResult = await _workerLockCollection
                .ReplaceOneAsync(
                    Builders<Data.WorkerLock>.Filter.Eq(l => l.Id, lockEntity.Id)
                    & Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockId)
                    & Builders<Data.WorkerLock>.Filter.Lt(l => l.LockExpires, DateTime.UtcNow),
                    lockEntity,
                    cancellationToken: cancellationToken);

            if (updateResult.IsModifiedCountAvailable && updateResult.ModifiedCount > 0)
            {
                return new LockHandle()
                {
                    LeaseId = lockEntity.LeaseId,
                    LockId = lockEntity.LockId,
                    LockPeriod = lockPeriod,
                    MongoId = lockEntity.Id
                };
            }
        }
        else
        {
            var upsertResult = await _workerLockCollection
                .UpdateOneAsync(
                    Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockEntity.LockId),
                    Builders<Data.WorkerLock>.Update
                        .Set(l => l.LockId, lockEntity.LockId)
                        .SetOnInsert(l => l.Account, lockEntity.Account)
                        .SetOnInsert(l => l.LockOwnerId, lockEntity.LockOwnerId)
                        .SetOnInsert(l => l.LeaseId, lockEntity.LeaseId)
                        .SetOnInsert(l => l.LockExpires, lockEntity.LockExpires)
                        .SetOnInsert(l => l.HostName, lockEntity.HostName),
                        new UpdateOptions() { IsUpsert = true },
                        cancellationToken
                );

            if (upsertResult.IsAcknowledged && upsertResult.UpsertedId != null && !upsertResult.UpsertedId.IsBsonNull)
            {
                lockEntity.Id = upsertResult.UpsertedId.AsObjectId;
                return new LockHandle()
                {
                    LeaseId = lockEntity.LeaseId,
                    LockId = lockEntity.LockId,
                    LockPeriod = lockPeriod,
                    MongoId = lockEntity.Id
                };
            }
        }

        return null;
    }
}
#pragma warning restore CS0618

And then I register it with the container in the HostBuilder's ConfigureServices method:

#pragma warning disable CS0618
// The default lock manager is causing crashes, so we've replaced it with a Mongo one
services.AddSingleton(typeof(IDistributedLockManager), typeof(MongoLockManager));
#pragma warning restore CS0618

Doing so here should beat AddAzureStorage (IIRC it was this method that registers the failing one) to the punch.

zmarty commented 3 years ago

@brettsam We saw this 7,457 times in the last 7 days in our Azure Function apps. Each such instance restarted the entire host.

{"assembly":"Microsoft.Azure.WebJobs.Host, Version=3.0.22.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer+d__14.MoveNext","level":15,"line":147,"fileName":"C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Timers\TaskSeriesTimer.cs"}