fog / fog-rackspace

Rackspace provider gem for Fog ecosystem
MIT License
8 stars 36 forks source link

Improve handling of asynchronous DNS callback request failures #36

Open cgunther opened 3 years ago

cgunther commented 3 years ago

Rackspace treats non-GET requests dealing with DNS (and maybe other services?) as asynchronous: https://docs.rackspace.com/docs/cloud-dns/v1/general-api-info/synchronous-and-asynchronous-responses

This gem handles that via: https://github.com/fog/fog-rackspace/blob/f8dcccd5ac9e7816d8eab250fe6b4e7c8b6fe0fa/lib/fog/rackspace/models/dns/record.rb#L61

https://github.com/fog/fog-rackspace/blob/f8dcccd5ac9e7816d8eab250fe6b4e7c8b6fe0fa/lib/fog/rackspace/models/dns/callback.rb#L7-L24

https://github.com/fog/fog-rackspace/blob/f8dcccd5ac9e7816d8eab250fe6b4e7c8b6fe0fa/lib/fog/rackspace/requests/dns/callback.rb#L5-L14

However it's expecting a 200, 202, or 204 response for the polling of the status to be successful.

Rackspace also enforces rate limits on requests, 5 per second for polling status, returning a 413 code when exceeding the limit: https://docs.rackspace.com/docs/cloud-dns/v1/general-api-info/limits#rate-limits

This can create a scenario where your code tries to create a record, the initial request to Rackspace is successful (but your application is still waiting), the gem starts polling for status, a polling request may return a non-200, 202, or 204 response (more common if you have multiple background jobs dealing with DNS simultaneously), which the gem then treats as a failure, which surfaces as if your code to create the record failed. However, that's not fully accurate, just a status request failed, the underlying job to create the record may still be processing and eventually succeed on it's own, however an error has already been raised in your application.

Given that the callback/status request is idempotent and just polling, I wonder if it should be less strict about what codes it expects, instead treating non-200, 202, 204 responses as a silent failure, triggering another retry. For example, if a callback request returned a 413 or 500 code, we likely don't need to treat the outer operation (adding a record) as a failure, we could just consider the callback a failure and hope we get a better response on the next retry, or ultimately erroring if we exceed the number of retries.