Netflix / dispatch

All of the ad-hoc things you're doing to manage incidents today, done for you, and much more!
Apache License 2.0
4.95k stars 488 forks source link

fix(plugin): delete non-conforming messages from sqs queue #5128

Closed wssheldon closed 2 weeks ago

wssheldon commented 2 weeks ago

Messages in the queue that do not conform should not stay in the queue

kevgliss commented 2 weeks ago

Is there a reason why you'd rather do this instead of letting them fail a few times and using the DLQ functionality?

wssheldon commented 2 weeks ago

Is there a reason why you'd rather do this instead of letting them fail a few times and using the DLQ functionality?

My thinking is that non-conforming instances shouldn't be retried since we are certain they will not work and can safely be immediately deleted. We would not want to re-process them either.

kevgliss commented 2 weeks ago

Is there a reason why you'd rather do this instead of letting them fail a few times and using the DLQ functionality?

My thinking is that non-conforming instances shouldn't be retried since we are certain they will not work and can safely be immediately deleted. We would not want to re-process them either.

I guess it depends on how non-conforming they are. The nice thing about the DLQ is that if the error is on the Dispatch side, we can push a change and then re-drive all of the messages.

wssheldon commented 2 weeks ago

Is there a reason why you'd rather do this instead of letting them fail a few times and using the DLQ functionality?

My thinking is that non-conforming instances shouldn't be retried since we are certain they will not work and can safely be immediately deleted. We would not want to re-process them either.

I guess it depends on how non-conforming they are. The nice thing about the DLQ is that if the error is on the Dispatch side, we can push a change and then re-drive all of the messages.

Would it make sense to only re-drive on general exceptions in the create_signal_instance function?

                try:
                    signal_instance = signal_service.create_signal_instance(
                        db_session=db_session,
                        signal_instance_in=signal_instance_in,
                    )
                except Exception as e:
                    log.exception(f"Unable to create signal instance: {e}")
                    continue

The schema is the schema is my thinking and any Validation errors we can safely discard. That's where my head's at but happy to go either direction if you think it makes sense.

kevgliss commented 2 weeks ago

Would it make sense to only re-drive on general exceptions in the create_signal_instance function?

                try:
                    signal_instance = signal_service.create_signal_instance(
                        db_session=db_session,
                        signal_instance_in=signal_instance_in,
                    )
                except Exception as e:
                    log.exception(f"Unable to create signal instance: {e}")
                    continue

The schema is the schema is my thinking and any Validation errors we can safely discard. That's where my head's at but happy to go either direction if you think it makes sense.

Yeah, the failure mode I was thinking about was things like the underlying database not having the correct migrations. This would cause create to fail, but the messages should still be re-processed.

wssheldon commented 2 weeks ago

@kevgliss I agree this is no longer needed with a DQL setup.