Hybrid Mode MVP - Githubissues

treeder commented 6 years ago

There are 3 main parts that will work together to make hybrid work.

1) Fn Server (API) 2) Fn LB 3) Fn Runners/Workers

At a high level, they interact like this:

Control Plane		Work Plane
Fn Server	←	Fn LB(s)
↓	↑	↓
Database	←	Fn Runner(s)
MQ		↓
Blob Store		?Blob Store

Where everything in the Work Plane talks to the Fn Server to tell it what to do.

As a bare minimum starting point, we should have 3 containers that we can run like this:

Start Control Plane:

docker run ... fnproject/fnserver

Start Work Plane

docker run ... fnproject/fnlb
docker run .... fnproject/fnrunner

Fn Server

This probably doesn't really have to change much except:

flag to disable async processing
new API endpoints to support below

Fn LB

allow runners to register themselves (simple API on LB)

Runners

When fnrunner starts, it will:
- register itself with the fnlb (k8s does this for us, since we're using this)
- start asking the fnserver for async calls
For each call it gets, it will:
- execute call
- report status to fnserver
- send logs to fnserver (this could store logs in local blob store, but for first attempt, let's assume this isn't the case)

rdallman commented 6 years ago

allow runners to register themselves (how to do this with multiple LBs running?)

it's backed by a db. just start lb with DB_URL=... same as server, supports mysql/pg.

rdallman commented 6 years ago

new API endpoints to support below

I have a sketched out design doc for this api for fn server that I think will work & be simple, i can try to put it into something more palatable. was attempting to add reasonable MQ semantics at the same time, but it's possible we could/should punt this.

zootalures commented 6 years ago

Am i right in thinking: async will go :

Outside -> (0: push) FnLB -> (1: push) Server-> (2: push) MQ -> (3: (poss via API) pull) runner

If so then direct MQ is a good start for 3 and then wrap in an API once that works I think

treeder commented 6 years ago

That sounds about right. I don't think we'd want runner to directly access MQ though, would add another layer of authentication and what not that we'd have to think about. Runner should be pretty dumb for the most part and just get work (sync) or ask for work (async), run it, report results, that's basically all it should think about. And hopefully just talk to one thing.

In fact, we used to have a/tasks endpoint that already did this via the API, but I think it got removed in the recent gutting of core.

rdallman commented 6 years ago

runner will also have to post work, for async. async requests will come in through the same lb as sync (in front of 'runners'), presumably. agree about direct access, shouldn't be too bad to proxy through api nodes.

rdallman commented 6 years ago

Fnlb ... ?? (i know there's more to this)

I don't think so. the lb just needs to know about the set of nodes. in theory, in our k8s config somewhere, there is a hook we can call when we add a new 'runner' pod that calls the lb address to add itself, so the lb is mostly an operational task at this point. imo, we should just do authentication at the 'runner' level, not the lb level, to keep the lb from getting gummed up (in 2+ ways).

rdallman commented 6 years ago

?Blob Store

it would be nice initially to have all data storage on the LHS of this dia. then the runner is just sending all data to the LHS api and doesn't have to worry about straddling 2+ things for certain things. We're still not certain that logs on the RHS will be in demand from users.

skinowski commented 6 years ago

I wonder if we could move async processing completely to RHS. Currently, control-plane is not really %100 control plane as it's in async processing path... We should be able to modify app/routes regardless of the load on the front door. (eg. MQ and call logs/status storage on RHS side versus mostly-read-only config DB on LHS.)

treeder commented 6 years ago

@skinowski one of the main objectives of all this is to keep the database and what not on the LHS so we/users don't have to maintain multiple databases/mqs/obstores.

msgodf commented 6 years ago

I'm going to make a start on this, namely the part

flag to disable async processing

First, running two instances of the fn server pointing at the same redis instance. Then I'll add a flag so that one of them only serves requests to invoke asynchronous functions on /r (so it puts the request on the queue), and the other serves both synchronous requests and pulls requests from the queue.

rdallman commented 6 years ago

should have a doc written up tomorrow of API design, would be great to split up work and get your thoughts (kafka & api implementation & testing & auth) @msgodf -- disabling async we do need, thanks!

rdallman commented 6 years ago

@msgodf @skinowski I've started a doc here https://github.com/fnproject/fn/blob/hybrid-api/docs/developers/hybrid.md based on discussion today (in the flesh), feedback appreciated. But overall, pretty lightweight. mostly, we need to do some kafka reading. sync is really easy.

hhexo commented 6 years ago

Shoud the /runner/... endpoints be api-versioned too, i.e. /runner/v1/...?

jan-g commented 6 years ago

It's not clear to me how the partitions/topics are expected map to runners. Partitions are (essentially) static - which is to say there's an async process for expanding the number of partitions a topic has, but that expansion can take seconds. (Brokers also have a number of FDs open per partition they're handling.)

Additionally, the API nodes are the kafka consumers in this picture. If they're in the same consumer group then you may want as many partitions as API nodes (or some reasonably larger number) to balance load across API nodes.

However, using the runner node (or something derived from it) as a partition key means that the work will land in some partition. There's no guarantee that a (k8s) load-balanced call from any particular runner will land on an API node whose consumer is handling the partition associated with that runner. (In the extreme case, if there are more API nodes than partitions, it's possible that an API node will be "starved" and have no work to deliver.)

I think the expected model with kafka is also that topics are relatively static; the way the broker works, it doesn't sound like a topic per (app, route) combo is workable either.

Thinking about it, even if we don't care about rendezvous-style clustering of (app, route) to preferred runners, there's still a problem with API nodes all in the same consumer group - to wit, if there's just one async Call and it's landed in partition #0 then runners may fail to be tasked with that Call if their requests to /runner/dequeue are loadbalanced to an API node whose consumer isn't servicing that partition.

The alternative there is potentially that we use one CG per API node (so all async Calls are effectively broadcast to all API nodes) - since there's some DB work to do before handing the work off to the runner, there's less of a problem perhaps in just looping through Call records until we find the next one that's untaken. (One might wonder why we use a MQ in that instance.) However, given the semantics of Kafka's offset commit, the API node would have to serve up the first work item it found, unless we wanted to drastically complicate the process of de-queuing work at the API node.

jan-g commented 6 years ago

It's possible that a particular kafka Consumer knows which partitions are in the set that are currently bound to it. The ConsumerGroup rebalance is another "eventual" kind of operation, but one option is for API nodes to gossip amongst themselves so that they know where a partition is being served from, then internally proxy a request for (partition X) from one API node to the one that owns its partition.

(Or even open/close consumer clients and pick the partition manually.)

hhexo commented 6 years ago

I am worried that there are layers between the runner nodes on the RHS, which have the information about what hot containers are running and therefore would prefer to dequeue a job that fits (because it would be efficient), and the MQ partitions on the LHS which could be partitioned with the same criteria for efficiency. Along the way, we don't just have the fn LB, there's likely to to be an OCI LB and a Kube service LB which could shuffle requests inefficiently because they don't know about the partitioning information. Basically, our final LB would have to undo the work that the OCI LB and the Kube LB have done.

The question is whether the partition-aware strategy that has to undo the OCI/Kube LBs distribution is really more efficient than a naive non-partition-aware pulling of jobs from any node, or whether the redirections at the last layer (where a request hits a pod which says "no, I'm not the right one, go there instead") add enough latency that statistically it's not worth the effort of doing it when the numbers are big. Also, when a user enqueues an async job, how much do they care about latency is another open question.

I don't know the answers, but I feel that not worrying about partitions at this stage makes the implementation simpler, and it can also be iteratively improved when we do have the data to determine the efficiency answer. :)

rdallman commented 6 years ago

I don't know the answers, but I feel that not worrying about partitions at this stage makes the implementation simpler, and it can also be iteratively improved when we do have the data to determine the efficiency answer. :)

I wish this were possible, kafka does not seem to have any 'easy' way to provide MPMC semantics without thinking about partitioning. We need to have someway to have 1 process chewing on a partition, we can't just let every request ask a queue for the head (this has pros and cons, not having to worry about timeouts sounds... delightful). Initial thinking was that API nodes would have their own partition and be part of a consumer group, and this seems like the naive approach that would likely work but has at least one deficiency in that we lose distribution information for RHS processing. The thinking is that, there are likely a small set of API nodes and a large set of runner nodes, and we want a runner node to process some subset of calls so that we can re-use hot containers, image caching (the same reasons we have fnlb for sync). Can think about other ways to accomplish this.

The complication of the k8s / round-robin load balancer in the middle is something that didn't come to mind (thanks for pointing it out). It's kind of unfortunate that we have to proxy runner nodes over to the LHS to talk to kafka, in a 'normal' (non-hybrid) deployment runners (full fn servers) getting their own partition seems like it would work just fine (maybe optimal, even). It seems brittle to rely on having a 'sticky' (vs. round robin, et al) load balancer in between RHS and LHS so that kafka clients will work, but it seems like it would maybe close this hole? since this API is hanging off on the side we could open a long lived connection between an API node and a runner node so that 1 API node could serve a runner's partition to it. this is kind of smelly, need to marinate on this a bit.

Not completely bent on having a partition per runner, the main sticking point is the distribution, but even if we have a partition per API node we have an issue of routing a msg.Commit() (increment partition offset) to an API node where the consumer for that partition is available. :(

Partitions are (essentially) static - which is to say there's an async process for expanding the number of partitions a topic has, but that expansion can take seconds.

This sounds edible, we expect that the set of runners is scaling up and down but relatively infrequently (likely 10s of minutes). Not sure how well consumer groups will adapt to this, possibly a bad idea. API nodes will be less elastic and may make a better candidate.

Shoud the /runner/... endpoints be api-versioned too, i.e. /runner/v1/...?

/v1/runner makes sense, API nodes serve all /v1/ endpoints so it's uniform. good catch, thanks.

jan-g commented 6 years ago

There are some other complications with effectively pre-assigning a task to a runner-oriented partition. There is no easy way to manage work stealing in that scenario - runnner A might be busy for a long time, runner B (which favours the same hot set) free; but there's not an easy way to land the work there. (If the intent is that the FNLB assigns an async task to a runner with the view that that work will be scheduled soon, then we'll need a way to ensure that capacity for it is reserved.) You mentioned timeouts. The other question revolves around async work that turns up with a deadline. If that is fast approaching then the Call's "priority" for placement should rise, I presume. Simple queues (especially under the constraint that multiplying topics or partitions is expensive) don't feel like a great fit here.

-- Excuse typos. Phone screens don't make great ketbroads.

-------- Original message -------- From: Reed Allman notifications@github.com Date: 01/12/2017 6:17 p.m. (GMT+00:00) To: fnproject/fn fn@noreply.github.com Cc: jan grant jang@ioctl.org, Comment comment@noreply.github.com Subject: Re: [fnproject/fn] Hybrid Mode MVP (#531)

I don't know the answers, but I feel that not worrying about partitions at this stage makes the implementation simpler, and it can also be iteratively improved when we do have the data to determine the efficiency answer. :)

I wish this were possible, kafka does not seem to have any 'easy' way to provide MPMC semantics without thinking about partitioning. We need to have someway to have 1 process chewing on a partition, we can't just let every request ask a queue for the head (this has pros and cons, not having to worry about timeouts sounds... delightful). Initial thinking was that API nodes would have their own partition and be part of a consumer group, and this seems like the naive approach that would likely work but has at least one deficiency in that we lose distribution information for RHS processing. The thinking is that, there are likely a small set of API nodes and a large set of runner nodes, and we want a runner node to process some subset of calls so that we can re-use hot containers, image caching (the same reasons we have fnlb for sync). Can think about other ways to accomplish this. The complication of the k8s / round-robin load balancer in the middle is something that didn't come to mind (thanks for pointing it out). It's kind of unfortunate that we have to proxy runner nodes over to the LHS to talk to kafka, in a 'normal' (non-hybrid) deployment runners (full fn servers) getting their own partition seems like it would work just fine (maybe optimal, even). It seems brittle to rely on having a 'sticky' (vs. round robin, et al) load balancer in between RHS and LHS so that kafka clients will work, but it seems like it would maybe close this hole? since this API is hanging off on the side we could open a long lived connection between an API node and a runner node so that 1 API node could serve a runner's partition to it. this is kind of smelly, need to marinate on this a bit. Not completely bent on having a partition per runner, the main sticking point is the distribution, but even if we have a partition per API node we have an issue of routing a msg.Commit() (increment partition offset) to an API node where the consumer for that partition is available. :(

Partitions are (essentially) static - which is to say there's an async process for expanding the number of partitions a topic has, but that expansion can take seconds.

This sounds edible, we expect that the set of runners is scaling up and down but relatively infrequently (likely 10s of minutes). Not sure how well consumer groups will adapt to this, possibly a bad idea. API nodes will be less elastic and may make a better candidate.

Shoud the /runner/... endpoints be api-versioned too, i.e. /runner/v1/...?

/v1/runner makes sense, API nodes serve all /v1/ endpoints so it's uniform. good catch, thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/fnproject/fn","title":"fnproject/fn","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/fnproject/fn"}},"updates":{"snippets":[{"icon":"PERSON","message":"@rdallman in #531: \u003e I don't know the answers, but I feel that not worrying about partitions at this stage makes the implementation simpler, and it can also be iteratively improved when we do have the data to determine the efficiency answer. :)\r\n\r\nI wish this were possible, kafka does not seem to have any 'easy' way to provide MPMC semantics without thinking about partitioning. We need to have someway to have 1 process chewing on a partition, we can't just let every request ask a queue for the head (this has pros and cons, not having to worry about timeouts sounds... delightful). Initial thinking was that API nodes would have their own partition and be part of a consumer group, and this seems like the naive approach that would likely work but has at least one deficiency in that we lose distribution information for RHS processing. The thinking is that, there are likely a small set of API nodes and a large set of runner nodes, and we want a runner node to process some subset of calls so that we can re-use hot containers, image caching (the same reasons we have fnlb for sync). Can think about other ways to accomplish this.\r\n\r\nThe complication of the k8s / round-robin load balancer in the middle is something that didn't come to mind (thanks for pointing it out). It's kind of unfortunate that we have to proxy runner nodes over to the LHS to talk to kafka, in a 'normal' (non-hybrid) deployment runners (full fn servers) getting their own partition seems like it would work just fine (maybe optimal, even). It seems brittle to rely on having a 'sticky' (vs. round robin, et al) load balancer in between RHS and LHS so that kafka clients will work, but it seems like it would maybe close this hole? since this API is hanging off on the side we could open a long lived connection between an API node and a runner node so that 1 API node could serve a runner's partition to it. this is kind of smelly, need to marinate on this a bit.\r\n\r\nNot completely bent on having a partition per runner, the main sticking point is the distribution, but even if we have a partition per API node we have an issue of routing a msg.Commit() (increment partition offset) to an API node where the consumer for that partition is available. :(\r\n\r\n\u003e Partitions are (essentially) static - which is to say there's an async process for expanding the number of partitions a topic has, but that expansion can take seconds. \r\n\r\nThis sounds edible, we expect that the set of runners is scaling up and down but relatively infrequently (likely 10s of minutes). Not sure how well consumer groups will adapt to this, possibly a bad idea. API nodes will be less elastic and may make a better candidate. \r\n\r\n\u003e Shoud the /runner/... endpoints be api-versioned too, i.e. /runner/v1/...?\r\n\r\n/v1/runner makes sense, API nodes serve all /v1/ endpoints so it's uniform. good catch, thanks."}],"action":{"name":"View Issue","url":"https://github.com/fnproject/fn/issues/531#issuecomment-348568462"}}}

rdallman commented 6 years ago

The other question revolves around async work that turns up with a deadline. If that is fast approaching then the Call's "priority" for placement should rise, I presume.

I am game to not have an idea of a deadline, as I may have interpreted this. I am somewhat concerned about the fact that we effectively need to implement real-timestamp-based priority queueing, but optimistic that we can use kafka's offsets as timestamps, to achieve delayed messages (I'm not sure this is the same as deadline? i.e. run this call at this time in the future X?), and even then I don't think we can possibly make any guarantees about the immediacy of running that; it'll be in line with anything else that came in and was scheduled to run before that time X. As far as priorities, our redis implementation has these but only uses p0. I would like to avoid adding explicit [p0,p1,p2] priorities to start (and forever, if we're being honest) since I think it's going to be quite a bit of work just to get one priority in.

There is no easy way to manage work stealing in that scenario

yep, agree. as proposed it is very optimistic that the node that enqueued it was routed to because it has enough room to run it once that is eventually dequeued, which may be very far in the future.

figuring out how to map a message that a runner has received from an API node to a request to delete that message to the same API node is still kind of what I'm stuck on. Even if all API nodes stay healthy, it seems like we're really fighting the kafka client semantics here. Apparently, they have an http gateway that you can stick in front of kafka to make it so consumers / producers can be less precise about exact positioning of messages wrt partitions, maybe this is the route we need to go down but damn, turtles. for reference: https://github.com/confluentinc/kafka-rest

jan-g commented 6 years ago

On a different tack: if nobody picks up the runner to lb registration I'll do that on Monday (following the idea of a k8s Grouper; we already have a REST interface for manual poking into allGrouper).

-- Excuse typos. Phone screens don't make great ketbroads.

-------- Original message -------- From: Reed Allman notifications@github.com Date: 02/12/2017 1:33 a.m. (GMT+00:00) To: fnproject/fn fn@noreply.github.com Cc: jan grant jang@ioctl.org, Comment comment@noreply.github.com Subject: Re: [fnproject/fn] Hybrid Mode MVP (#531)

The other question revolves around async work that turns up with a deadline. If that is fast approaching then the Call's "priority" for placement should rise, I presume.

I am game to not have an idea of a deadline, as I may have interpreted this. I am somewhat concerned about the fact that we effectively need to implement real-timestamp-based priority queueing, but optimistic that we can use kafka's offsets as timestamps, to achieve delayed messages (I'm not sure this is the same as deadline? i.e. run this call at this time in the future X?), and even then I don't think we can possibly make any guarantees about the immediacy of running that; it'll be in line with anything else that came in and was scheduled to run before that time X. As far as priorities, our redis implementation has these but only uses p0. I would like to avoid adding explicit [p0,p1,p2] priorities to start (and forever, if we're being honest) since I think it's going to be quite a bit of work just to get one priority in.

There is no easy way to manage work stealing in that scenario

yep, agree. as proposed it is very optimistic that the node that enqueued it was routed to because it has enough room to run it once that is eventually dequeued, which may be very far in the future. figuring out how to map a message that a runner has received from an API node to a request to delete that message to the same API node is still kind of what I'm stuck on. Even if all API nodes stay healthy, it seems like we're really fighting the kafka client semantics here. Apparently, they have an http gateway that you can stick in front of kafka to make it so consumers / producers can be less precise about exact positioning of messages wrt partitions, maybe this is the route we need to go down but damn, turtles. for reference: https://github.com/confluentinc/kafka-rest

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/fnproject/fn","title":"fnproject/fn","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/fnproject/fn"}},"updates":{"snippets":[{"icon":"PERSON","message":"@rdallman in #531: \u003e The other question revolves around async work that turns up with a deadline. If that is fast approaching then the Call's \"priority\" for placement should rise, I presume.\r\n\r\nI am game to not have an idea of a deadline, as I may have interpreted this. I am somewhat concerned about the fact that we effectively need to implement real-timestamp-based priority queueing, but optimistic that we can use kafka's offsets as timestamps, to achieve delayed messages (I'm not sure this is the same as deadline? i.e. run this call at this time in the future X?), and even then I don't think we can possibly make any guarantees about the immediacy of running that; it'll be in line with anything else that came in and was scheduled to run before that time X. As far as priorities, our redis implementation has these but only uses p0. I would like to avoid adding explicit [p0,p1,p2] priorities to start (and forever, if we're being honest) since I think it's going to be quite a bit of work just to get one priority in. \r\n\r\n\u003e There is no easy way to manage work stealing in that scenario\r\n\r\nyep, agree. as proposed it is very optimistic that the node that enqueued it was routed to because it has enough room to run it once that is eventually dequeued, which may be very far in the future. \r\n\r\nfiguring out how to map a message that a runner has received from an API node to a request to delete that message to the same API node is still kind of what I'm stuck on. Even if all API nodes stay healthy, it seems like we're really fighting the kafka client semantics here. Apparently, they have an http gateway that you can stick in front of kafka to make it so consumers / producers can be less precise about exact positioning of messages wrt partitions, maybe this is the route we need to go down but damn, turtles. for reference: https://github.com/confluentinc/kafka-rest"}],"action":{"name":"View Issue","url":"https://github.com/fnproject/fn/issues/531#issuecomment-348657162"}}}

hhexo commented 6 years ago

The thinking is that, there are likely a small set of API nodes and a large set of runner nodes, and we want a runner node to process some subset of calls so that we can re-use hot containers, image caching

I'm still not 100% convinced about the partitioning of the RHS runners... If we have a large number of runners but they are locked to the same partitions as API nodes (so statistically you have N api nodes and K * N workers), and if there's an imbalance of load on a set of functions (some with high load, some with low load), don't we risk that we have a pool of runners staying idle while others are overloaded like Jan said?

I think we need some form of work stealing, even though when you steal work you are "cold" and you have to pull images and start up containers. The question is how significant that is on async functions that the user has started with the expectation that "at some point they will complete". Maybe they don't care about the 100ms of container start, but they do about the few seconds of image pulling. Hm.

I'm wondering how statistics would work with work-stealing, maybe we get our desired effect anyway. If we have partitioned API nodes but naive runner nodes (which just pull the next job from whatever node the LBs happen to land the request on), then if a function is more loaded than others eventually all the runner nodes will cache its image just by virtue of statistics. The downside is that if all functions are loaded equally then all runner nodes will try to cache all images and that's a bit heavy.

figuring out how to map a message that a runner has received from an API node to a request to delete that message to the same API node is still kind of what I'm stuck on

From what I can see in the proposed spec, the runners pull (poll?) the API nodes for jobs to run when they call /dequeue, and /start doesn't appear to need to land on the same partition (the DB is the arbiter for the semantics). Only /finish needs to be partition-aware, but once a runner is executing a job it can know which partition it came from (the information could be in headers or in a field of the payload of the /dequeue response), so it can embed the same information in the /finish call. This way only one out of four endpoints (probably less than a quarter of the requests) needs to be routed properly - and we still need a solution for this, but at least we get work-stealing... and maybe if there are few requests needing this routing it becomes feasible to have the API nodes do additional work to redirect the request undoing the interference of the intermediate LBs.

hhexo commented 6 years ago

Anyway, going back to the spirit of this ticket, which MQ we use (e.g. kafka) is more of an implementation detail and I guess the main focus is the API (to which we can later add protocol-specific information, like the kafka partition info).

In general terms, the API detailed in your document makes sense. I'll push the minor fixes about the runner endpoints being under v1/ as well, and then coordinate with @msgodf - I guess we can start implementing the endpoints using the current MQ solution.

rdallman commented 6 years ago

I think we need some form of work stealing, even though when you steal work you are "cold" and you have to pull images and start up containers. The question is how significant that is on async functions that the user has started with the expectation that "at some point they will complete". Maybe they don't care about the 100ms of container start, but they do about the few seconds of image pulling. Hm.

for async we aren't really bounding the queued time so I think users care much much less about the docker overhead. we can definitely use this to our advantage. however, I think that on our side of the house we do/should care about this overhead. as you pointed out, one issue is the disk space of caching all images. on a similar note, if we can leverage hot functions for async and shove them through a few nodes faster than, say, eating the startup time and pull time of the image across all machines, then we can also reduce the queued time not only for each route, but across all routes. queues getting backed up for async is probably going to be one of the most common customer complaints (we can wait to hit this bridge, probably, too). I agree we definitely need some kind of work stealing so that we can scale, I do think we need to keep sharding / distribution in mind. perhaps the better answer for now, though, is to punt, so we can move forward, with kafka specifically we have other implementation blockers atm.

the runners pull (poll?)

yea, poll is the intent. will clarify in doc.

and maybe if there are few requests needing this routing it becomes feasible to have the API nodes do additional work to redirect the request undoing the interference of the intermediate LBs.

this is an interesting idea. yea, we could have a little proxy among api nodes. the possibly-really-hokey but imo really-totally-okay is that the response from /dequeue returns the public IP address of the API node from which that message was consumed to use in /finish, the API nodes are going to be exposed to the internet anyway (granted, ideally by way of an LB). as currently proposed, we also need to route runner nodes to the same API node repeatedly in /dequeue, as in theory only one API node has a consumer open for each runner's partition (we could round robin through API nodes until we land in the right spot, but it seems wasteful). similarly, /enqueue needs to route to some API node that has a producer thread open on behalf of a runner's partition, this doesn't have to be the same API node as the consumer, however (doesn't buy us a lot). If we partition with RHS keying, I already smell the 2 API nodes need to not be able to have a producer / consumer open concurrently [I think?], which is pretty unappealing.

rdallman commented 6 years ago

I guess we can start implementing the endpoints using the current MQ solution.

yep. I see 2 concrete and separable tasks that come out of this:

1) implement a data access interface layer for the agent that encapsulates the functionality we need, with one implementation that uses the models.Datastore and models.MessageQueue directly, and the other that uses a yet created runner API client (could omit this until after 2), need to shuffle code around first). the methods on this interface [can/should] map pretty easily to the API spec, Enqueue, Dequeue, Start and End (the latter 2 exist, though maybe should be moved around a little to be next to the other 2). Enqueue seems like we could omit as things are now, but it might be worth pushing down into agent. 2) implement the server API as outlined in the doc above (and if going above and beyond, get the go client bindings over with; we don't need to use swagger here imo, we don't really want users using these apis directly).

I'll take one if y'all want to take the other, you pick ;)

hhexo commented 6 years ago

Do we need @msgodf 's patch to have a flag that separates API and runner nodes? It does touch the agent, but maybe we don't need it to get started.

The data access interface (1.) sounds interesting. I'll start doing that in the hybrid-api branch.

hhexo commented 6 years ago

I'm also thinking that the runner nodes could learn where it's more efficient to /dequeue jobs from... we don't even need complicated ML for this, it should be sufficient to have a simple weighted probability table that is updated based on whether we successfully get an efficiently runnable job. Reinforcement learning, if you like, but the simplest version of it. This has also the advantage that we don't need to worry about nodes moving around or being re-partitioned (the tables will update on their own eventually).

hhexo commented 6 years ago

implement a data access interface layer for the agent that encapsulates the functionality we need

I've tried abstracting this at the data access layer, and I've managed a quick refactor (pushed a commit), but now I'm looking at it and I'm not convinced it's useful at that layer... so I've probably misinterpreted. From what I can see, I think the agent itself probably needs a refactor, like you say we "need to shuffle code around first".

What I envisage is a runnerAgent and an apiAgent (and maybe a fullAgent for the current implementation?), implementing Agent which will have the new methods corresponding to the new endpoints. This actually may supersede part of what @msgodf is doing, at that point the flag to switch mode becomes a decider for which struct is created.

Am I on the right track?

rdallman commented 6 years ago

This actually may supersede part of what @msgodf is doing, at that point the flag to switch mode becomes a decider for which struct is created.

this sounds right, the flag will determine how to configure the agent. I guess agent's responsibility mostly is managing the pool of calls and the interfacing with the mq/db is pretty minimal was the thinking, so we can just shove something into the agent to use to access data when needed vs needing to change the agent at large (the surface area is really just GetCall/Submit). certainly, there are other ways to skin the cat.

what you pushed looks pretty good to me. once we have the runner API client thing, agent.New's signature can maybe look something more like the agent.GetCall where there's 2 ways to construct a data layer for the Agent and the Server can configure the correct one based on flags. but what you have looks good to me. this should slot in with Mark's stuff pretty easily, we can probably slide that in first then fix agent configuration ?

skinowski commented 6 years ago

I think fnlb shouldn't make a load based decision on async requests, it should just round robin them. The state of fn servers at the time of MQ queuing could be very stale.

Since fnlb has the big picture, maybe it could be a proxy to the API servers. fn servers could poll API servers through fnlb, which could act as a switch board to route/intercept the responses to the proper fn server. Maybe this could help with /dequeue and /finish mapping issue (since probably fnlb will be in communication with all API servers.)

hhexo commented 6 years ago

I guess agent's responsibility mostly is managing the pool of calls and the interfacing with the mq/db is pretty minimal what you pushed looks pretty good to me

That's fine, but I was asking myself what will be responsible for handling the dequeue/start/finish API calls in the API-only node. The Agent doesn't surface the interface to do so, so will the Server go straight to the data? Even in that case we'll need some sort of "NoOp agent" for the API-only node... I suppose we'll cross that bridge when we get to it.

hhexo commented 6 years ago

Also, currently we don't have an "update call" method in the datastore, and we'll need it to update its status (from queued to running to end state), so I'll add an implementation of that too, as an atomic thing (it must basically do a CAS). We might want an index on some fields too, to speed up queries.

rdallman commented 6 years ago

That's fine, but I was asking myself what will be responsible for handling the dequeue/start/finish API calls in the API-only node. The Agent doesn't surface the interface to do so, so will the Server go straight to the data?

this is what I was thinking. I should have a final draft today to PR, need the bit to shut things off to test it (or the agent beats me to dequeue ;) -- can see how it looks and go from there. Mark was discussing not starting an agent in the API nodes at all, which seems wise.

Also, currently we don't have an "update call" method in the datastore, ... so I'll add an implementation of that too, ...

thanks. i'll stub this out for now in start/finish. do you want to PR your agent changes or coordinating with merging @msgodf stuff ?

hhexo commented 6 years ago

I'll coordinate with @msgodf. I've added the UpdateCall method and I'm writing tests at the moment, should push things soon to hybrid-api.

[edit: pushed the UpdateCall stuff!]

hhexo commented 6 years ago

I've merged with @msgodf 's stuff in the hybrid-api branch. I could then merge all into hybrid-impl if you want. Shall we work on the same branch? It seems odd to have two.

rdallman commented 6 years ago

sweet. yea, I wouldn't mind merging stuff into master and branching off of that, I don't think any of this is really all that intrusive wrt base functionality but we need to move code around and would be nice to stay up to date with master changes. thoughts? -> PR hybrid-api to master? I don't think order of landing hybrid-impl or hybrid-api to master matters, can rebase similarities out and there's not much new code overlap (if any).

rdallman commented 6 years ago

i pushed a commit to hybrid-api with my changes and opened up a separate https://github.com/fnproject/fn/pull/581 (which is hybrid-api) which is merge-able to master. PTAL.

TODOs I see atm:

[x] ensure possibility to start a *server.Server / binary without specifying MQ/DB/LOG (it would be nice if server.New reflected this, and each of those things were ServerOpts)
[x] probably don't start an agent at all in API nodes? (agent.New could drop DB/MQ/LOG and use opts)
[x] discuss whether API nodes should be able to enqueue async tasks (I think no, if anything just to reduce the surface area and complexity stuff, also keeps all call creation in runners (id gen, caching of apps/routes). just shut down all /r/ on API nodes is really straightforward)
[x] create client bindings for the runner nodes to call runner API, wrap a new DataAccessLayer around this
[ ] examine possibility of letting call update be an idempotent set and having a separate conditional 'update status' ? at least, it seems like we can maneuver around a full call CAS with just the semantics we need. enqueue is insert, start is conditional status switch & add started_at, finish could be idempotent set.
other stuff?

after we get an MVP working with redis mq we should prob get back to investigating kafka (we don't wanna make a redis operator and do that whole dance), seems like we're pretty close.

i'll try to get client stuff done meow

rdallman commented 6 years ago

just tacked on a client to #581, it's kind of off on the side for now too so sliding it in (will work on DataAccessLayer / integration tests with it tomorrow unless someone beats me to it)

rdallman commented 6 years ago

I've managed to get sync tasks running in split mode (yay) in https://github.com/fnproject/fn/tree/hybrid-datarappa branch (based on hybrid-mergy branch), this does a lot of shuffling around of the server config stuff mostly and fixes some of the client wiring bugs. at least, most of the plumbing is done now [I think]. for now, reverted the protocol back to the same semantics we have on master, we can fix this later.

i've run out of time for today, but async isn't working just yet, will merge into hybrid-mergy or master if we merge #581 once I fiddle with async a little on monday. but anyway notifying because it knocks out a bunch of those todos and moves stuff around a fair amount. I don't think #581 breaks anything on master so I left it without this for now in hopes of merging that.

to run split mode tasks on hybrid-datarappa (with a sync task set up):

FN_NODE_TYPE=api FN_LOG_LEVEL=debug ./fn &
FN_NODE_TYPE=runner FN_PORT=8081 FN_RUNNER_URL=http://localhost:8080 ./fn &
curl -d '{"name":"yodawg"}' -v localhost:8081/r/hot-app/hello
echo "ta da"

jan-g commented 6 years ago

(quick note: "When fnrunner starts, it will: - register itself with the fnlb" is effectively done - the fnlb has the ability to discover runner nodes when running under k8s. In a non-k8s deployment, fnlb instances need to be launched with teh --nodes flag or the management interface poked.)

hhexo commented 6 years ago

Good stuff!

The FN_RUNNER_URL new env var in hybrid-datarappa... it doesn't point to a runner node, it points to an API node where we can invoke the /v1/runner endpoints. I can see the reason for the name but it may be a bit confusing. I have no suggestions though (we can't use FN_API_URL as that has another meaning).

I'm happy to merge the hybrid-mergy branch down to master as it does not affect the default case.

rdallman commented 6 years ago

The FN_RUNNER_URL new env var in hybrid-datarappa

agreed, it's weird but the logical FN_API_URL is taken [by the cli, so not exactly]. toyed with FN_RUNNER_API_URL, FN_URL, FN_HYBRID_URL, happy to change it.

merged into master, thanks for reviewing. it should be easier to PR and review stuff against master now. i'll get async working today with the old messaging protocol.

meaty pieces left:

kafka
auth
operationalizing
adding cache to wrap the DataAccessLayer's GetRoute and GetApp methods, currently un-plumbed for runner nodes since it wraps the Datastore atm [which a "runner" does not use]

(the checkboxes on the parent comment are not a good approximation of the work required to get this in, I'm changing them to a list so this doesn't look 'done' in github's UI as everything outlined was pretty basic)

rdallman commented 6 years ago

opened https://github.com/fnproject/fn/pull/585 w/ today's work, I think it's ready. we can likely add /api/server tests of full sync/async tasks in this mode with 2 server.Server active & communicating on top of that work. the other thing to do is moving the cache from the datastore to the dataaccess layer, which shouldn't be too bad and at some point probably we should doc the api & behavior (maybe wait until auth/kafka?). have at it or I will get to it tomorrow.

auth & kafka will be hair balls & require some research/planning, we should divide and conquer. i'm not sure it's worth doing extensive [load] testing / operationalizing until these are finished.

hhexo commented 6 years ago

I have added caching to the hybrid client in #585 but I'm not sure we should "move" the caching from the datastore to there. The datastore cache is still going to be useful in the api node and the "full" standalone node, I guess there are things other than the Agent which use GetApp and GetRoute.

hhexo commented 6 years ago

Actually it turns out the datastore cache wasn't being used by anything else so I have now removed it.

hhexo commented 6 years ago

I propose we change the env var for the runner API to FN_RUNNER_API_URL ( https://github.com/fnproject/fn/pull/592 ).

I'm toying with some helm chart changes to support a hybrid deployment, and setting FN_RUNNER_URL to something-api really looks odd.

hhexo commented 6 years ago

I've just realised a minor issue about hybrid mode that I don't think we've addressed yet.

For a developer using the cli, FN_API_URL will have to point to the API nodes LB address, not the runner nodes LB address. This is fine. However, fn call is a development operation that will have to reach the /r/... endpoints which require using the runners LB address. We'll need a solution for this.

hhexo commented 6 years ago

Hybrid deployment helm chart changes up for discussion in: https://github.com/fnproject/fn-helm/pull/9

hhexo commented 6 years ago

Updated hybrid deployment PR to have two separate charts.

I have a proposal to solve the problems with the split mode and fn call. We could have an FN_CONTROL_API_URL and an FN_WORK_API_URL as environment variables that the cli uses, respectively for control plane URL (/v1/...) and work plane URL (/r/...). However, for backwards compatibility (and for standalone deployments) if FN_API_URL is provided then it will be used for both the control plane URL and the work plane URL. How does that sound?

rdallman commented 6 years ago

sounds like a good plan. I'm not sure of the verbiage we want to use user facing, personally I find work/control plane not immediately very intuitive. I am also skeptical of being in small classes with other kids that didn't behave well as a child, however.

wdyt about re-using FN_API_URL to represent the /v1 endpoints and allowing FN_WORK_API_URL to make /r/ requests against runners (if provided, which overrides FN_API_URL for /r/ only)? I'd toss FN_RUNNER_API_URL around for usage in the CLI, but I guess we already use that one in the runner itself. I think 'runner' is more intuitive than 'work'. maybe FN_GATEWAY_URL (it will likely be a load balancer) ?

fnproject / fn

Hybrid Mode MVP #531

Fn Server

Fn LB

Runners