Open alxbl opened 6 years ago
@alxbl Thank you for the extremely thorough bug report. While I'm not a service bus expert, I do not think this is a documentation issue.
We're in the middle of looking at a few different service bus bugs that seem to relate to completion. I can't promise we'll fix this one immediately but we're at least going to do a first-pass investigation.
No worries, we've rewritten the code to always explicitly complete, so there's no time pressure on this on our part. Thanks for looking into it :)
The other issue Paul was mentioning is https://github.com/Azure/azure-webjobs-sdk/issues/1605. It might be related given that during the restart the operation cancel exceptions will be causing your in flight messages to fail (but they should be retried).
If I get a build up on myget with the fix for #1605 would you be willing to try it with your repro?
@mathewc I can give it a shot sometime this weekend. Let me know when the build is available.
Instructions for our myget feed are here.
Build 2.2.0-beta1-11248 of Microsoft.Azure.WebJobs.ServiceBus has the fix.
Good morning,
I had a bit of time to try and repro this morning. I did a run with 100K messages and a S3/10 instances. I did about 10 restarts and scaled down to S3/1 for the last 2K messages.
The run had 160 messages missing, which feels like an improvement from last time... but it's been many months since I last worked on this particular project, and have since moved to a different team so it's difficult to be certain.
> JobFeeder.exe validate
Establishing connection... ok
Validating... failed!
160 blobs are missing
4727, 4728, 4729, 4730, 4731, 4732, 4733, 4734, 4735, 4736, 4737, 4738, 4739, 4740, 4741, 4742,
4743, 4744, 4745, 5266, 5267, 5268, 5269, 5270, 5271, 5272, 5273, 5274, 5275, 5276, 5277, 5278,
5279, 5280, 5281, 5282, 5283, 5284, 5285, 5286, 5287, 5288, 5289, 5290, 5291, 5292, 5293, 5294,
5295, 5296, 5297, 5298, 5299, 5300, 5301, 5302, 5303, 5304, 5305, 5306, 5307, 5308, 5309, 5310,
5311, 5312, 5313, 5314, 5315, 5316, 5317, 5318, 5319, 5320, 5321, 5322, 5323, 5324, 5325, 5326,
5327, 5328, 5329, 5330, 5526, 5527, 5528, 5529, 5530, 5531, 5532, 5533, 5534, 5535, 5536, 5537,
5538, 5539, 5540, 5541, 5542, 5543, 5544, 5545, 5546, 5547, 5548, 5549, 5550, 5551, 5552, 5553,
5554, 5555, 5556, 5557, 5558, 5559, 5560, 5561, 5562, 5563, 5564, 5565, 5566, 5567, 5568, 5569,
5570, 5571, 5572, 5573, 5574, 5575, 5576, 5577, 5578, 5579, 5580, 5581, 5582, 5583, 5584, 16907,
16908, 16911, 16913, 16918, 16920, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931,
16932, 16933, 16934
Unfortunately, with my current repro code I don't have much more information than that to give. I've attached my packages.config for the WebJob project, in hope that it might prove helpful. In particular:
<package id="Microsoft.Azure.WebJobs" version="2.2.0-beta1-11248" targetFramework="net462" />
<package id="Microsoft.Azure.WebJobs.Core" version="2.2.0-beta1-11248" targetFramework="net462" />
<package id="Microsoft.Azure.WebJobs.ServiceBus" version="2.2.0-beta1-11248" targetFramework="net462" />
The code I use to repro is available in the issue description, all that is missing would be a powershell script that automates the start/restart of the webjob in order to fully automate the test+validation run.
Let me know if I can do anything else to help.
Whats the current status of this Issue?
As far as I'm aware it should still be possible to reproduce this issue following the repro steps provided above with the sample code. Unfortunately, I am no longer working in an Azure environment and do not have the resources to re-test. We had ended up using the workaround I described (manually completing all messages and disabling auto-complete) and it seemed to have fixed the problems. I no longer work there though so I'm not sure if that is still the case.
Cheers,
(This was initially opened with the Storage SDK as https://github.com/Azure/azure-storage-net/issues/539, I have adapted the title to fit the real cause of the issue)
I was asked to open it here for cross-referencing purposes.
Restarting a WebJob with a
ServiceBusTrigger
andAutoComplete=true
can lead to data loss.It looks like when a WebJob that writes to blob storage is restarted, there is a race condition where topic messages can be marked as completed (We use
AutoComplete = true
) but the blob isn't actually present in the storage.Repro
This seems to happen as soon as the WebJob needs to restart, so all the following lead to repro:
This doesn't seem to happen when restarting the WebApp that is hosting the WebJob. I believe this is normal because the WebJob is running on the Service Plan and not on the WebApp itself (Correct me if I'm wrong.)
I wrote a sample application that does away with all of our business logic and narrows it down to one of the Azure SDKs (or my ability to C#) for code review or if someone would like to try to reproduce the issue on their end.
The best way to reproduce this issue is to play around with the application settings, scale and to manually restart the webjob while it is processing messages.
Configuration
Storage Account
Standard
Disabled
LRS
WebJob
true
true
true
true
32
64
120 seconds
On
Sample App Pseudo-Code
WebJob
numbers
.1, 2, ..., N
in the topictopic/sub
.N
is the expected number of blobs.m
intopic/sub
,m
with arbitrary content in thenumbers
container. (await b.UploadTextAsync(m.ToString(), 'true')
)m
as complete (<auto-complete>
)Validation
numbers
, call the collectionblobs
.blobs.Count
==N
.expected
be the collection of numbers1..N
.expected.Except(blobs).Count == 0
.expected.Except(blobs)
.(Assertions fail and a list of blobs is printed.)
What we've tried so far
CloudBlockContainer.UploadTextAsync()
.async
calls to theirsync
version.What we're planning to try in parallel with this issue
AutoComplete
is the issueHttpClient
to try and rule out the Storage SDK.Function
(I would rather not resort to this in our production deployments)Impact
As you would guess, it's a bit unnerving to deploy to production or even enable autoscale / change settings in a production environment when it carries a risk of losing production data in the process.
Workaround
As mentioned in the linked issue, we've found that
AutoComplete = false
prevents this issue from happening, and we've resorted to using that right now.Actual Behavior
Messages are marked as completed before actual processing succeeds, so a shutdown will result in lost messages
Expected Behavior
The messages should only be completed if the handler executes successfully (returns and did not throw.)
Related Information
As I mentioned in the referenced issue:
This makes me think that either the AutoComplete documentation is wrong or there is a bug in the WebJobs.ServiceBus library (emphasis mine):
(Notice that one sentence reads "successful execution", the other simply says "completed processing")
To summarize, this is clearly not a Storage SDK bug (Sorry!), but might be a bug/documentation typo in the WebJobs.ServiceBus SDK.
Maybe the documentation should clarify what the purpose of AutoComplete is and if this is the expected behavior, clearly mention so.
Cheers, Alex