alxbl commented 6 years ago

(This was initially opened with the Storage SDK as https://github.com/Azure/azure-storage-net/issues/539, I have adapted the title to fit the real cause of the issue)

I was asked to open it here for cross-referencing purposes.

Restarting a WebJob with a `ServiceBusTrigger` and `AutoComplete=true` can lead to data loss.

It looks like when a WebJob that writes to blob storage is restarted, there is a race condition where topic messages can be marked as completed (We use AutoComplete = true) but the blob isn't actually present in the storage.

Repro

This seems to happen as soon as the WebJob needs to restart, so all the following lead to repro:

Application Settings Changes (On Save)
Scale Up (e.g. S1 -> S3)
Scale Out (e.g. 1 instance -> 10 instances)
Scale In (e.g. 10 instances -> 1 instance)
Scale Down (e.g. S3 -> S1)
Manual WebJob Restart (Restart and/or Stop then Start)

This doesn't seem to happen when restarting the WebApp that is hosting the WebJob. I believe this is normal because the WebJob is running on the Service Plan and not on the WebApp itself (Correct me if I'm wrong.)

I wrote a sample application that does away with all of our business logic and narrows it down to one of the Azure SDKs (or my ability to C#) for code review or if someone would like to try to reproduce the issue on their end.

The best way to reproduce this issue is to play around with the application settings, scale and to manually restart the webjob while it is processing messages.

For an average test run, I see roughly 2000-5000 blobs out of ~100K that would simply not be in blob storage, but also not be in any dead letter queue or elsewhere. This varies depending on how many times I trigger web job restarts as mentioned above.

Reproducing can be a little time consuming, so what I usually do is setup an S3 with 10 instances of the webjob and change the application settings/restart a few times while I watch the topic queue length go down in ServiceBusExplorer. When the message count is low enough, I'll scale down to S1 1 instance until the queues hit 0 and stop the job before running the validation script (see pseudocode below)

Configuration

Storage Account

Performance = Standard
Secure Transfer = Disabled
Replication = LRS

WebJob

Topic:
- Enabled Batched Messages = true
- Enabled Partitioning = true
Subscription:
- Enabled Batched Messages = true
- Deadletter on filter evaluation errors = true
Max Concurrent Calls = 32
Prefetch Count = 64
Auto-Renew Timeout = 120 seconds
Always On (Web App) = On

Sample App Pseudo-Code

WebJob

Call the destination container numbers.
Let there be messages 1, 2, ..., N in the topic topic/sub. N is the expected number of blobs.
For message m in topic/sub,
1. Create a blob named m with arbitrary content in the numbers container. (await b.UploadTextAsync(m.ToString(), 'true'))
2. Mark message m as complete (<auto-complete>)

Validation

Retrieve all the blobs in numbers, call the collection blobs.
Assert blobs.Count == N.
Let expected be the collection of numbers 1..N.
Assert expected.Except(blobs).Count == 0.
If any, print expected.Except(blobs).

(Assertions fail and a list of blobs is printed.)

What we've tried so far

Stripped away all of the business logic to rule out the application.
Made sure cancellation tokens are passed all the way down to CloudBlockContainer.UploadTextAsync().
Tried switching async calls to their sync version.
Upgraded to Storage SDK 8.4.0.
Upgraded to latest WebJobs and ServiceBus SDKs.

What we're planning to try in parallel with this issue

Manual message completion to try and determine if AutoComplete is the issue
Use REST API directly via HttpClient to try and rule out the Storage SDK.
Hacky code that attempts to verify that the blob is written before returning out of the Function (I would rather not resort to this in our production deployments)

Impact

As you would guess, it's a bit unnerving to deploy to production or even enable autoscale / change settings in a production environment when it carries a risk of losing production data in the process.

Workaround

As mentioned in the linked issue, we've found that AutoComplete = false prevents this issue from happening, and we've resorted to using that right now.

Actual Behavior

Messages are marked as completed before actual processing succeeds, so a shutdown will result in lost messages

Expected Behavior

The messages should only be completed if the handler executes successfully (returns and did not throw.)

Related Information

As I mentioned in the referenced issue:

This makes me think that either the AutoComplete documentation is wrong or there is a bug in the WebJobs.ServiceBus library (emphasis mine):

Gets or sets a value that indicates whether the message-pump should call CompleteAsync(String) or CompleteAsync(String) on messages after the callback has completed processing.

true to complete the message processing automatically on successful execution of the operation; otherwise, false.

(Notice that one sentence reads "successful execution", the other simply says "completed processing")

To summarize, this is clearly not a Storage SDK bug (Sorry!), but might be a bug/documentation typo in the WebJobs.ServiceBus SDK.

Maybe the documentation should clarify what the purpose of AutoComplete is and if this is the expected behavior, clearly mention so.

Cheers, Alex

paulbatum commented 6 years ago

@alxbl Thank you for the extremely thorough bug report. While I'm not a service bus expert, I do not think this is a documentation issue.

paulbatum commented 6 years ago

We're in the middle of looking at a few different service bus bugs that seem to relate to completion. I can't promise we'll fix this one immediately but we're at least going to do a first-pass investigation.

alxbl commented 6 years ago

No worries, we've rewritten the code to always explicitly complete, so there's no time pressure on this on our part. Thanks for looking into it :)

mathewc commented 6 years ago

The other issue Paul was mentioning is https://github.com/Azure/azure-webjobs-sdk/issues/1605. It might be related given that during the restart the operation cancel exceptions will be causing your in flight messages to fail (but they should be retried).

If I get a build up on myget with the fix for #1605 would you be willing to try it with your repro?

alxbl commented 6 years ago

@mathewc I can give it a shot sometime this weekend. Let me know when the build is available.

mathewc commented 6 years ago

Instructions for our myget feed are here.

Build 2.2.0-beta1-11248 of Microsoft.Azure.WebJobs.ServiceBus has the fix.

alxbl commented 6 years ago

Good morning,

I had a bit of time to try and repro this morning. I did a run with 100K messages and a S3/10 instances. I did about 10 restarts and scaled down to S3/1 for the last 2K messages.

The run had 160 messages missing, which feels like an improvement from last time... but it's been many months since I last worked on this particular project, and have since moved to a different team so it's difficult to be certain.

> JobFeeder.exe validate
Establishing connection... ok
Validating... failed!
160 blobs are missing
4727, 4728, 4729, 4730, 4731, 4732, 4733, 4734, 4735, 4736, 4737, 4738, 4739, 4740, 4741, 4742,
4743, 4744, 4745, 5266, 5267, 5268, 5269, 5270, 5271, 5272, 5273, 5274, 5275, 5276, 5277, 5278, 
5279, 5280, 5281, 5282, 5283, 5284, 5285, 5286, 5287, 5288, 5289, 5290, 5291, 5292, 5293, 5294, 
5295, 5296, 5297, 5298, 5299, 5300, 5301, 5302, 5303, 5304, 5305, 5306, 5307, 5308, 5309, 5310, 
5311, 5312, 5313, 5314, 5315, 5316, 5317, 5318, 5319, 5320, 5321, 5322, 5323, 5324, 5325, 5326, 
5327, 5328, 5329, 5330, 5526, 5527, 5528, 5529, 5530, 5531, 5532, 5533, 5534, 5535, 5536, 5537, 
5538, 5539, 5540, 5541, 5542, 5543, 5544, 5545, 5546, 5547, 5548, 5549, 5550, 5551, 5552, 5553, 
5554, 5555, 5556, 5557, 5558, 5559, 5560, 5561, 5562, 5563, 5564, 5565, 5566, 5567, 5568, 5569, 
5570, 5571, 5572, 5573, 5574, 5575, 5576, 5577, 5578, 5579, 5580, 5581, 5582, 5583, 5584, 16907, 
16908, 16911, 16913, 16918, 16920, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931, 
16932, 16933, 16934

Unfortunately, with my current repro code I don't have much more information than that to give. I've attached my packages.config for the WebJob project, in hope that it might prove helpful. In particular:

<package id="Microsoft.Azure.WebJobs" version="2.2.0-beta1-11248" targetFramework="net462" />                  
<package id="Microsoft.Azure.WebJobs.Core" version="2.2.0-beta1-11248" targetFramework="net462" />             
<package id="Microsoft.Azure.WebJobs.ServiceBus" version="2.2.0-beta1-11248" targetFramework="net462" />

The code I use to repro is available in the issue description, all that is missing would be a powershell script that automates the start/restart of the webjob in order to fully automate the test+validation run.

Let me know if I can do anything else to help.

packages.txt

davidhjones commented 5 years ago

Whats the current status of this Issue?

alxbl commented 5 years ago

As far as I'm aware it should still be possible to reproduce this issue following the repro steps provided above with the sample code. Unfortunately, I am no longer working in an Azure environment and do not have the resources to re-test. We had ended up using the workaround I described (manually completing all messages and disabling auto-complete) and it seemed to have fixed the problems. I no longer work there though so I'm not sure if that is still the case.

Cheers,

Azure / azure-webjobs-sdk