dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.07k stars 2.03k forks source link

AzureQueueDataManager not handling 404 when message is deleted under heavy(ish) load #1140

Closed andycwk closed 8 years ago

andycwk commented 8 years ago

While testing a new stream for processing the usage of file chunks (there could be hundred per file) I've bumped into an issue where Azure intermittently returns a 404 on a message delete operation this appears to happen when under load.

The issue is that the queue manager appears to stop processing the queues after this point!

When I restart the Silo, the queue starts processing again, but now I'm receiving stream errors for Orleans.Streams.QueueCacheMissException's and again the queue processing is frozen :(

Any thought, help?

My Azure queue stream provider is configures as follows...

      config.Globals.RegisterStreamProvider< AzureQueueStreamProvider >( "AzureQueueProvider", new Dictionary< string, string > {
        {"DataConnectionString", connectionString},
        {"DeploymentId", "streams"},
        {PersistentStreamProviderConfig.STREAM_PUBSUB_TYPE, "ImplicitOnly"}
      } );

My stream consumer is as follows...

  [ ImplicitStreamSubscription( "FragmentUsage" ) ]
  public class FragmentUsageStreamConsumer : Grain, IFragmentUsageStreamConsumer, IAsyncObserver< FragmentUsage >
  {
    private Logger logger;
    private int fragmentCount;

    public override async Task OnActivateAsync()
    {
      logger = base.GetLogger( string.Format( "{0} {1}", GetType().Name, base.IdentityString ) );

      var streamProvider = GetStreamProvider( StreamProvider.AzureQueueProvider );

      var handler = streamProvider.GetStream< FragmentUsage >(
        this.GetPrimaryKey(),
        StreamNamespaces.FragmentUsage );

      await handler.SubscribeAsync( this );

      await base.OnActivateAsync();
    }

    public async Task OnNextAsync( FragmentUsage item, StreamSequenceToken token = null )
    {
      fragmentCount++;
      await GrainFactory
        .GetGrain< IFragment >( item.FingerPrint )
        .RegisterUsageFor( item );
    }

    public Task OnCompletedAsync()
    {
      logger.Info( "Stream reported complete with {0} processed", "fragment".ToQuantity( fragmentCount ) );

      return TaskDone.Done;
    }

    public Task OnErrorAsync( Exception ex )
    {
      logger.Error( 0, "Stream reported error", ex );

      return TaskDone.Done;
    }
  }

I send the stream using..

    public Task Publish( Guid streamId, TType message )
    {
      var stream = Provider.GetStream< TType >( streamId, ns );
      return stream.OnNextAsync( message );
    }

cheers!

gabikliot commented 8 years ago

My SQL stream provider

SQL or Azure Queue?

Do you have the silo log for when "404 on a message delete operation this appears to happen when under load ... and then queue manager appears to stop processing the queues after this point".

andycwk commented 8 years ago

IT's Azure Queue's as per the configuration above :)

I don't hve the log - I have configuration in place to build a log but it was not there when I RDP'd into the server to fetch it down :(

I do have Exceptionless reports from a log consumer I have in place, but that doesn't really show what happened just before :(

andycwk commented 8 years ago

I don't know if it makes any difference but I was running a single Large Azure worker role for the silo at the time!

veikkoeeva commented 8 years ago

@andycwk We don't have a (public) SQL stream provider yet, as far as I know. There's a rough sketch over at https://github.com/dotnet/orleans/issues/634 how to make one I believe would a fast one. :)

andycwk commented 8 years ago

Ah sorry - not sure what I was thinking when I typed up the description (edited now) if you look at the code snippet, this has the correct info :|

It's 100% an Azure Queue Provider implementation straight out of the box :)

gabikliot commented 8 years ago

@andycwk, we would really need the silo log to troubleshot this. The protocol is supposed to handle failures of deleting from the Azure Queue, by design, but obviously based on your observations there is some bug. I don't think I can do much without the silo log. Can you please try to rerun and make sure logs are produced and send us the log next time it happens?

Thank you.

gabikliot commented 8 years ago

@jason-bragg, do you think we should catch the exception if this line throws https://github.com/dotnet/orleans/blob/master/src/OrleansAzureUtils/Providers/Streams/AzureQueue/AzureQueueAdapterReceiver.cs#L98 and ignore it and never throw from MessagesDeliveredAsync?

The protocol should work either way, if we ignore or not, since if we throw it will break the loop in the agent but will restart with next timer and if we ignore we will just keep going the same way. But maybe somehow if we do throw we get into a loop of purging the cache again and again, and if it keeps throwing for some time (imagine AQ is really busy and any attempt to delete fails), we will basically not be doing any event processing in the agent, since the loop will be cut short every time.

Looks like it might be safer to catch and ignore? What do you think?

jason-bragg commented 8 years ago

@gabikliot Short version: I think the agent should log and ignore errors from MessagesDeliveredAsync. I don't really see this as an AQ issue. I think this is about agent error handling of calls to receivers. In general, if the agent calls MessagesDeliveredAsync and the call fails, how should we handle that? I don't think there is much we can do other than log and ignore. :/ This behavior should, however, be in the agent, not in the adapter, unless we explicitly state that the MessagesDeliveredAsync should not throw, which also limits what error handling the agent can perform in the future.

gabikliot commented 8 years ago

I thought actually quite a bit about it. The reason I still think it should be in the adapter is that what is the meaning of success or failure of MessagesDeliveredAsync depends on the adapter I think. In AQ the failure is OK since AQ will redeliver and thus the agent does not need to do anything. AAQ adapter know it, not the agent. But in some other adapter that may not be the case, so the agent does need to know about it, maybe to do something. I don't know what this something could be now beyond log and ignore, but maybe we will have a scenario for that. So that was the reason I thought it would be more correct to let adapter decide whether the agent should know about it or not.

But that question is secondary to the fact that we should catch (in adapter or in agent, one of the two) - agree? If after my explanation you still think it should be in the agent, I am OK with that. We can always change this in the future when such new scenario arrives.

jason-bragg commented 8 years ago

I think we're pretty much on the same page.

But that question is secondary to the fact that we should catch (in adapter or in agent, one of the two) - agree?

Yup. I am mainly claiming that the agent should handle adapter errors, even if there is not much it can do. As I think you've also concluded, if we fix this only in the AQ adapter, it remains a potential issue in our framework.

AQ will redeliver and thus the agent does not need to do anything.

Agreed. My initial take on this was that the AQ adapter should probably throw, as this propagates the error to the agent, which at least enables the agent to do something about it, even if there's nothing we can really do at this time. This is more of an information question, but I can completely see you're point.

So let's address this in the AQ adapter and the agent? The agent because we should be handling adapter errors, and the AQ adapter because that is the proper behavior for that adapter.

gabikliot commented 8 years ago

So let's address this in the AQ adapter and the agent?

Yes. Agreed.

andycwk commented 8 years ago

Hay guys, Thanks for the activity on this one... I'm sorry I've not been interactive with the comment stream, I'm really busy at the moment trying not to overrun a go-live date we have for our services!

I'm very keen to move my code base back to the stream model, but at the moment, with go-live shadowing me, I've been forced to switch over to 'old world' queue and worker processes to deal with my needs for now.

It looks like you guys are onto the issue and I will be more than happy to test the dev builds against our code base and report back with logs etc :)

Thanks guys and loving the direction your taking Orleans!

gabikliot commented 8 years ago

Fixed via #1203.