onsi commented 9 years ago

The doc considers this a TBD. Let's figure it out.

onsi commented 9 years ago

@fraenkel said: I am leery of another state since its not really a state of the LRP but of the Cell. There is no knowledge as to why it failed or when it would recover other than retry. Whether its full or just some network failure, its really no different. Yes, we can do better at retries if we knew it was a placement issue, but you still don't know when its safe to retry.

onsi commented 9 years ago

@onsi replied: It's not even the state of "the Cell" but really the state of "the Cluster". I agree that adding yet another state to the LRP is sucky.

I'd go back to what we need. If the cluster is full we need:

To log loudly and emit metrics to make sure an administrator knows.
Keep retrying periodically in case space frees up/the administrator acts.
Let the user know, somehow, that we aren't running their ActualLRPs because we're out of capacity.

1 can be solved with logs and metrics

2 can be solved by simply leaving the ActualLRP in the `UNCLAIMED` state. The converger will retry every 30s (which should be OK)

3 is the tricky one and is the main motivation for a new `FAILED` state (or maybe better would be `UNALLOCATABLE` or `UNCLAIMABLE` which behaves just like `UNCLAIMED` but signals to the consumer that the cluster is full).

Will open an Issue with all this in it and perhaps we can discuss. Not updating the doc just yet.

jbayer commented 9 years ago

3 could also be handled by writing a system message to the log stream for the LRP. that's how i found the $PWD thing this weekend.

onsi commented 9 years ago

Killing the FAILED state idea -- I've submitted #15 as a proposal for how to communicate at-capacity situations back to the user.

cloudfoundry / diego-notes

How to handle being out of capacity? #2

1 can be solved with logs and metrics

2 can be solved by simply leaving the ActualLRP in the `UNCLAIMED` state. The converger will retry every 30s (which should be OK)

3 is the tricky one and is the main motivation for a new `FAILED` state (or maybe better would be `UNALLOCATABLE` or `UNCLAIMABLE` which behaves just like `UNCLAIMED` but signals to the consumer that the cluster is full).

3 could also be handled by writing a system message to the log stream for the LRP. that's how i found the $PWD thing this weekend.

cloudfoundry / diego-notes

How to handle being out of capacity? #2

1 can be solved with logs and metrics

2 can be solved by simply leaving the ActualLRP in the UNCLAIMED state. The converger will retry every 30s (which should be OK)

3 is the tricky one and is the main motivation for a new FAILED state (or maybe better would be UNALLOCATABLE or UNCLAIMABLE which behaves just like UNCLAIMED but signals to the consumer that the cluster is full).

3 could also be handled by writing a system message to the log stream for the LRP. that's how i found the $PWD thing this weekend.

2 can be solved by simply leaving the ActualLRP in the `UNCLAIMED` state. The converger will retry every 30s (which should be OK)

3 is the tricky one and is the main motivation for a new `FAILED` state (or maybe better would be `UNALLOCATABLE` or `UNCLAIMABLE` which behaves just like `UNCLAIMED` but signals to the consumer that the cluster is full).