Open lostllama opened 5 years ago
I've seen this 27 times today alone in our production cluster.
@fabiocav
Does anyone care about this one? This is causing our webjobs application to cycle multiple times a day. I've created the following workaround which I'm about to test:
bool continueRunning = false;
do
{
try
{
continueRunning = false;
await host.RunAsync();
}
catch (Microsoft.WindowsAzure.Storage.StorageException se) when (se.Message.Contains("lease ID") || se.Message.Contains("Singleton lock renewal"))
{
continueRunning = true;
logger.LogError(se, $"Worker failed: {se.Message}");
}
}
while (continueRunning);
This workaround is hardly ideal.
OK, so my workaround doesn't work. It appears that I missed a fact that the lock manager employed by webjobs is actually starting its own thread, and the errors are happening there, crashing the application.
My new workaround was implementing my own IDistributedLockManager
(which the obsolete flag suggests I shouldn't do, but it's that or my application crashes). This appears to have solved the problem.
@lostllama Could you please explain how did you implement the workaround i am facing the same issue! Thanks!
@fabiocav Can you have a look at this issue please! Thanks!
@lostllama Could you please explain how did you implement the workaround i am facing the same issue!
@Alicedev25 I'm not sure what the "account" property in the interface is supposed to do, but I took "lockId" only as the the unique key. So use at your own risk, and I take no responsibility if it doesn't work how you expect :) Note that IDistributedLockManager
has an Obsolete
attribute stating: "Not ready for public consumption."
Now, I modified mine to use Mongo as a lease store, but you can change it to whatever you're using (e.g. SQL).
#pragma warning disable CS0618
public class MongoLockManager : IDistributedLockManager
{
private readonly IMongoCollection<Data.WorkerLock> _workerLockCollection;
private readonly ILogger _wrapperLogger;
private class LockHandle : IDistributedLock
{
public string LockId { get; set; }
public string LeaseId { get; set; }
public ObjectId MongoId { get; set; }
public TimeSpan LockPeriod { get; set; }
}
public MongoLockManager(IMongoDatabase db, ILoggerFactory loggerFactory)
{
_workerLockCollection = db.GetCollection<Data.WorkerLock>(Data.CollectionNames.WorkerLocks);
_wrapperLogger = loggerFactory.CreateLogger<MongoLockManager>();
}
public async Task<string> GetLockOwnerAsync(string account, string lockId, CancellationToken cancellationToken)
{
var lockObject = await _workerLockCollection
.Find(
Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockId)
& Builders<Data.WorkerLock>.Filter.Gte(l => l.LockExpires, DateTime.UtcNow)
& Builders<Data.WorkerLock>.Filter.Eq(l => l.Account, account)
)
.FirstOrDefaultAsync(cancellationToken);
return lockObject?.LockOwnerId;
}
public async Task ReleaseLockAsync(IDistributedLock lockHandle, CancellationToken cancellationToken)
{
if (!(lockHandle is LockHandle handle))
{
return;
}
await _workerLockCollection
.DeleteOneAsync(
Builders<Data.WorkerLock>.Filter.Eq(l => l.Id, handle.MongoId)
& Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, handle.LockId)
& Builders<Data.WorkerLock>.Filter.Eq(l => l.LeaseId, handle.LeaseId),
cancellationToken: cancellationToken
);
}
public async Task<bool> RenewAsync(IDistributedLock lockHandle, CancellationToken cancellationToken)
{
if (!(lockHandle is LockHandle handle))
{
return false;
}
var result = await _workerLockCollection
.UpdateOneAsync(
Builders<Data.WorkerLock>.Filter.Eq(l => l.Id, handle.MongoId)
& Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, handle.LockId)
& Builders<Data.WorkerLock>.Filter.Eq(l => l.LeaseId, handle.LeaseId),
Builders<Data.WorkerLock>.Update.Set(l => l.LockExpires, DateTime.UtcNow.Add(handle.LockPeriod)),
cancellationToken: cancellationToken
);
var updatedLock = result.IsModifiedCountAvailable && result.ModifiedCount > 0;
if (!updatedLock)
{
var entries = await _workerLockCollection
.CountDocumentsAsync(
Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, handle.LockId),
cancellationToken: cancellationToken
);
if (entries > 0)
{
_wrapperLogger.LogInformation($"Unable to reacquire lock: {handle.LockId}.");
}
return false;
}
return updatedLock;
}
public async Task<IDistributedLock> TryLockAsync(string account, string lockId, string lockOwnerId, string proposedLeaseId, TimeSpan lockPeriod, CancellationToken cancellationToken)
{
var lockEntity = new Data.WorkerLock()
{
Account = account,
LockId = lockId,
LockOwnerId = lockOwnerId ?? Environment.MachineName,
LeaseId = proposedLeaseId ?? Guid.NewGuid().ToString(),
LockExpires = DateTime.UtcNow + lockPeriod,
HostName = Environment.MachineName
};
// Try and find existing lock object
var existingLockMongoId = (await _workerLockCollection
.Find(Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockId))
.FirstOrDefaultAsync(cancellationToken))?.Id;
if (existingLockMongoId != null)
{
lockEntity.Id = existingLockMongoId.Value;
var updateResult = await _workerLockCollection
.ReplaceOneAsync(
Builders<Data.WorkerLock>.Filter.Eq(l => l.Id, lockEntity.Id)
& Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockId)
& Builders<Data.WorkerLock>.Filter.Lt(l => l.LockExpires, DateTime.UtcNow),
lockEntity,
cancellationToken: cancellationToken);
if (updateResult.IsModifiedCountAvailable && updateResult.ModifiedCount > 0)
{
return new LockHandle()
{
LeaseId = lockEntity.LeaseId,
LockId = lockEntity.LockId,
LockPeriod = lockPeriod,
MongoId = lockEntity.Id
};
}
}
else
{
var upsertResult = await _workerLockCollection
.UpdateOneAsync(
Builders<Data.WorkerLock>.Filter.Eq(l => l.LockId, lockEntity.LockId),
Builders<Data.WorkerLock>.Update
.Set(l => l.LockId, lockEntity.LockId)
.SetOnInsert(l => l.Account, lockEntity.Account)
.SetOnInsert(l => l.LockOwnerId, lockEntity.LockOwnerId)
.SetOnInsert(l => l.LeaseId, lockEntity.LeaseId)
.SetOnInsert(l => l.LockExpires, lockEntity.LockExpires)
.SetOnInsert(l => l.HostName, lockEntity.HostName),
new UpdateOptions() { IsUpsert = true },
cancellationToken
);
if (upsertResult.IsAcknowledged && upsertResult.UpsertedId != null && !upsertResult.UpsertedId.IsBsonNull)
{
lockEntity.Id = upsertResult.UpsertedId.AsObjectId;
return new LockHandle()
{
LeaseId = lockEntity.LeaseId,
LockId = lockEntity.LockId,
LockPeriod = lockPeriod,
MongoId = lockEntity.Id
};
}
}
return null;
}
}
#pragma warning restore CS0618
And then I register it with the container in the HostBuilder
's ConfigureServices
method:
#pragma warning disable CS0618
// The default lock manager is causing crashes, so we've replaced it with a Mongo one
services.AddSingleton(typeof(IDistributedLockManager), typeof(MongoLockManager));
#pragma warning restore CS0618
Doing so here should beat AddAzureStorage
(IIRC it was this method that registers the failing one) to the punch.
@brettsam We saw this 7,457 times in the last 7 days in our Azure Function apps. Each such instance restarted the entire host.
{"assembly":"Microsoft.Azure.WebJobs.Host, Version=3.0.22.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35","method":"Microsoft.Azure.WebJobs.Host.Timers.TaskSeriesTimer+
I'm getting the following exception when using WebJobs using Microsoft.Azure.WebJobs.Core v 3.0.3 at a rate of about 55x per month:
Repro steps
Provide the steps required to reproduce the problem
Startup simply contains:
Referencing:
Expected behavior
WebJobs worker runs without crashing.
Actual behavior
WebJobs worker occasionally crashes with above issue.
Known workarounds
N/A
Related information
Provide any related information