cloudfoundry / diego-notes

Diego Notes
Apache License 2.0
23 stars 7 forks source link

What if only the Rep created ActualLRPs and there were no state machine? #5

Closed Amit-PivotalLabs closed 9 years ago

Amit-PivotalLabs commented 9 years ago

Rather than things creating and deleting ActualLRPs to indicate different desires, only the Rep would create or delete them to reflect actual state. This would involve several changes, most of which end up being simplifications:

We could even just have ActualLRP's State field reflect the Executor container's state. So CLAIMED might be renamed to INITIALIZING, and we could add in the other Executor states if we wanted to.

amitkgupta commented 9 years ago

One downside I see, but don't think is that bad, is when a Cloud Foundry user of Diego does cf apps MYAPP we'll have to check two places to produce a report, the ActualLRPs for the ProcessGuid and the recent Crashes associated with the ProcessGuid.

amitkgupta commented 9 years ago

Idea for what the RepBBS will get from the LRPBBS:

CreateActualLRP(desiredLRP models.DesiredLRP, cellID string, index int, logger lager.Logger) {
    // generates an instance guid, and then creates and stores an ActualLRP in the INITIALIZING state
    // this is done by the Rep after it has won an auction and initialized the container
    // note that there's no return values, no caller actually used the resulting ActualLRP, and the error was just used for logging
    // note the Since field goes away on ActualLRP
}

UpsertInitializingActualLRP(pg string, idx int, ig string, cid string, domain string, cid string) error {
    // check if there's an ActualLRP with same pg and idx:
    // -- if there is one, and ig, cid, or domain don't match, return an error, and the Rep should delete the container
    // -- otherwise upsert the actual LRP with state INITIALIZING
    // this is done by the Rep when its poller/event-streamer sees the container state
    // note that it doesn't return an ActualLRP, no caller actually used the result
    // note that there is no state transition logic, it's just reporting the true state of the world
}

UpsertRunningActualLRP(pg string, idx int, ig string, cid string, domain string, cid string, ports map[int]int, host string) error {
    // same as above, but with RUNNING
}

DeleteActualLRP(pg string, idx int, ig string, cid string, domain string, logger lager.Logger) {
    // delete ActualLRP with matching pg, idx, ig, and cid
    // no problem if none exists
}
onsi commented 9 years ago

This is closer to how things used to work. I agree it's simpler. I'll spend some time updating the doc to reflect this to make sure there aren't any gaps. A few annoying things fall out like having to make sure crash records for undesired instances get reaped, but the overall simplification here might be worth it.

emalm commented 9 years ago

@tedsuo and I were finding it helpful to group some of the ActualLRP fields into their own types that could then be embedded in and referenced from complete ActualLRPs. We collected ProcessGuid, Index, InstanceGuid, and Domain into an ActualLRPKey type, and Ports and Host into an ActualLRPNetInfo type. Then the lifecycle methods get much simpler signatures that more precisely reflect the validation and updates they perform:

CreateActualLRP(key ActualLRPKey) (*ActualLRP, error)
ClaimActualLRP(key ActualLRPKey, cellID string) (*ActualLRP, error)
StartActualLRP(key ActualLRPKey, cellID string, netInfo ActualLRPNetInfo) (*ActualLRP, error)
RemoveActualLRP(key ActualLRPKey, cellID string) error
onsi commented 9 years ago

I spent some more time thinking about this and I don't think we want to lose UNCLAIMED and CRASHED:

I think, when the dust settles, the code that comes out of the current approach will not be substantially more complicated than what was outlined above. To steal Eric's functions and flesh them out (as I imagine they might be implemented - also I removed the *ActualLRP in the return -- not sure what was using that):

/// Given to the Converger and Receptor.  Performed whenever an ActualLRP is desired

CreateActualLRP(desired DesiredLRP, index int) (error) {
    // generates an instance guid then CREATEs an ActualLRP
    // if the CREATE succeeds: sends a start request made up of (DesiredLRP, ActualLRPKey)
}

/// Given to the Converger and Receptor.  Performed whenever an ActualLRP is not desired.  This is currently calld "Retire" but really should just be "Stop"

StopActualLRP(key ActualLRPKey) error {
    // get the ActualLRP
    // if there is none, return nil
    // if there is one and its CLAIMED or RUNNING:
    //   send a stop to the rep
    // if there is one and its UNCLAIMED:
    //   CAD the ActualLRP
}

---

/// The rest are given to the Rep

ClaimActualLRP(key ActualLRPKey, cellID string) (error) {
    // get the ActualLRP
    // if there is none:
    //   return an error, rep deletes container
    // if it's UNCLAIMED:
    //   CAS to CLAIMED, return nil
    // if it's CLAIMED and matches (ig,cellid):
    //   do nothing, return nil
    // if it's CLAIMED and does not match (ig,cellid):
    //   return error, rep deletes container
    // if it's RUNNING and matches ig and cellid:
    //   CAS to CLAIMED, return nil
    // if it's running and does not match ig and cellid:
    //   return error, rep deletes container
}

StartActualLRP(key ActualLRPKey, cellID string, netInfo ActualLRPNetInfo) (error) {
    // get the ActualLRP
    // if there is none:
    //   CREATE RUNNING, return nil
    // if it's UNCLAIMED:
    //   CAS to RUNNING, return nil
    // if it's CLAIMED and matches (ig,cellid):
    //   CAS to RUNNING, return nil
    // if it's CLAIMED and does not match (ig,cellid):
    //   CAS to RUNNING, return nil
    // if it's RUNNING and matches ig and cellid:
    //   do nothing, return nil
    // if it's running and does not match ig and cellid:
    //   return error, rep deletes container
}

RemoveActualLRP(key ActualLRPKey, cellID string) error {
    //called when the Rep thinks the BBS has an ActualLRP on the rep's cell, but the rep does not have a matching container
    // get the ActualLRP
    // if there is none:
    //    return error
    // if there is one in the UNCLAIMED state:
    //    return error
    // if there is one in the CLAIMED/RUNNING state corresponding to (ig,cellid):
    //    CAD the ActualLRP, return nil
    // if there is one in the CLAIMED/RUNNING state corresponding to some other (ig,cellid):
    //     return error
    //
    // upon error: rep simply logs
}
onsi commented 9 years ago

Closing this out.