List of changes required

debloper commented 6 years ago

[ ] Make success response structure as { message: string, data: object|null }
[ ] Make error response structure as { message: string, error: { code: number, details: string|null } }
[ ] Ensure API endpoints do not require trailing slash
[ ] Rename /accounts/ endpoints to /user
- [ ] Create user endpoint to be /user/create
- [ ] Expect sent password to be already hashed once from client-side
- [ ] Do not impose any character/length restriction on the password
- [ ] Unless other strategies are used before Beta, get rid of /local from auth endpoint
- [ ] Turn change password request into a PATCH instead of POST
- [ ] Consider s/Password//g for oldPassword & newPassword keys
Update /streams/ endpoint by abstracting away extra-steps
- [ ] Upon creation of a user, rig should already allocate blank stream-config for the user
- [ ] No need to support the possibility of multiple stream configs for single user for the time being
- [ ] Therefore, stream config id is not needed (streamKey (mutable) should be tied to userID)
- [ ] provisionStatus: true upon addition of first ingestion endpoint
- [ ] provisionStatus: false upon removal of last ingestion endpoint (or manual deprovision)
- [ ] Make it clearer how isStreaming & isActive add any useful value to stream config
- [ ] Preferably remove the add ingestion endpoint and use update ingestion (PATCH) endpoint
- [ ] Remove /streams/{streamID}/destination
- [ ] Add /streams/{streamKey}/ as a PATCH request which accepts list of ingestions
- [ ] Do not expect stream-key to be mandatory information
- [ ] Expect ingestion URLs as full URL and not by service name and stream-key
- [ ] Expect frontend to handle forming the full stream URL (rtmpUrl) and send as a list
- [ ] Update stream details endpoint from /streams/{streamID} to /streams/{streamKey}/
- [ ] List all stream endpoint may not be required for API v1 (co-stream feature)

kaustavdm commented 6 years ago

Thanks. Only thing I didn't understand was:

Consider s/Password//g for oldPassword & newPassword keys

Explain?

kaustavdm commented 6 years ago

@debloper Detailed response follows. Look out for ACK, NACK and Question. The ones with Question need answer:

Make success response structure as { message: string, data: object|null }

ACK. I like payload, but this works as well. Will do.

Make error response structure as { message: string, error: { code: number, details: string|null } }

ACK. Okay. I see that you like nesting.

Ensure API endpoints do not require trailing slash

ACK. On the bucket list already. Current trailing slash was more of a by-product of quick implementation.

Rename /accounts/ endpoints to /user

ACK. Makes sense. Till date I don't know why I used /accounts/ in the HTTP API. Possibly an artifact of postgres naming collision.

Create user endpoint to be /user/create

Question: I assume you are suggestion POST /user/create. Why not POST /user/? Implementing a

GET /user is also on my cards which can be called by authenticated users to get account/profile level information in one call.

Expect sent password to be already hashed once from client-side

NACK. I'm using bcrypt to hash it server-side. If you are worried about transmitting passwords as is, we are still using TLS for all external communication. Client side hashing will be overkill. You don't get any security edge over transmitting the password as is over TLS. You need to ensure you hash the values using same salt everytime else they will not match. So, you will have to use a shared key as salt. Or implement some heavy mechanism to do it. In either case, client side password hashing makes no sense when using TLS. If your TLS layer is compromised, then well, you have bigger issues.

Do not impose any character/length restriction on the password

ACK. Min 8 characters too tough for people? I will still want to restrict it to at least 4. Even a foetus can remember 4 character passwords.

Unless other strategies are used before Beta, get rid of /local from auth endpoint

ACK. I had plans to allow people logging in through OAuth2 and OpenID providers, but will put that in the backburner.

Turn change password request into a PATCH instead of POST

Question: Are you suggesting that all "data update" endpoints use PATCH instead of POST?

Consider s/Password//g for oldPassword & newPassword keys

Question: Does this mean that you don't want to ask user to provide old password to update to a new password? Works for me.

Update /streams/ endpoint by abstracting away extra-steps Upon creation of a user, rig should already allocate blank stream-config for the user

ACK.

No need to support the possibility of multiple stream configs for single user for the time being

ACK. I will not remove code that's there, because I see value for this in near future. Instead, I'll map the API to respond to the default stream for current user.

Therefore, stream config id is not needed (streamKey (mutable) should be tied to userID)

ACK. Same as above.

provisionStatus: true upon addition of first ingestion endpoint provisionStatus: false upon removal of last ingestion endpoint (or manual deprovision)

This part is not implemented yet. Though this is not a boolean value. provisionStatus was supposed to contain information on where the stream has been provisioned. This includes server and region info. So, this will only be set when a stream has been provisioned.

Make it clearer how isStreaming & isActive add any useful value to stream config

isStreaming is set to true if the user is currently streaming. isActive is the equivalent to the behaviour of what you have asked for provisionStatus above. If a stream config has isActive false (due to various reasons) then edge servers don't allow that stream to pass through.

Preferably remove the add ingestion endpoint and use update ingestion (PATCH) endpoint Remove /streams/{streamID}/destination Add /streams/{streamKey}/ as a PATCH request which accepts list of ingestions

ACK, but...(see next)

   Do not expect stream-key to be mandatory information
   Expect ingestion URLs as full URL and not by service name and stream-key

While I can parse the service identity from the given URL, there are two reasons for the current service based design:

rig will be able to choose which ingestion server to use for a known service depending on which of our edge server is receiving the stream. This will cut down on latency, improve network throughput and cut down costs when hosted on some providers. (Not implemented yet, but something on beta pipeline).
Fetching stream key through OAuth access is one of the quick ways to set things up for a user and something that is on the cards. In order to that, rig will need to know better idea of which service's API key to call.

   Expect frontend to handle forming the full stream URL (rtmpUrl) and send as a list

While possible, not necessary. Rig knows which servers are available for which service. All what we should ask from the user is their stream key for that service. The routing decision is best left to rig.

Now, there can be a separate option for "Custom" destination, where user can provide full URL for services that we don't yet support.

Update stream details endpoint from /streams/{streamID} to /streams/{streamKey}/

ACK.

List all stream endpoint may not be required for API v1 (co-stream feature)

Doesn't need changing. If all we are allowing is one stream config per user, then it is more future-proof to send the stream config as array with one item, like GET /streams does now. The dashboard can call GET /streams and use information of the first item till then (needs to be handled in dashboard's APIClient).

Regarding provisioning

We'll stick with runtime provisioning of streams. This does not change any of the requested user-facing feature. Except this makes way for a failsafe and flexible media infrastructure. It works equally well when we have 1 server vs when we have thousands of servers. Quick glance on the two approaches:

Runtime provisioning: Agents running on edge servers allocate resources when the edge servers receive a new stream. On a new stream publish, agents ask rig to send them configuration for the given stream key and spin up media worker (currently nginx+nginx-rtmp) process in the same machine. Then, when the stream is unpublished, agents clean up the worker. Streams start flowing within 500-600ms of publishing.

Pre-Provisioning/Ahead-of-time provisioning: When user selects a server, rig instructs agent of the selected server to spin up a media worker process with the stream's configuration. Agents keep the worker processes running till rig asks them to de-provision/remove the configuration. Here agents will still need to keep track of the available configs and locally relay published streams to available configurations.

I will detail this separately over the course of this month (detailed architecture docs and so on). For now, gist of the reasons of this approach is below:

Agents will need to have bi-directional communication with rig. They will have to send stream events, stats and errors as they happen. So, enforcing one-directional communication from rig->agents is unnecessarily limiting the scope. There is nothing unmangeable and unsafe about bi-directional communication in media services as long as they happen over clearly defined events and interfaces. This will enable us to make decisions on the edge at runtime.
Ahead of time provisioning to a specific server becomes more error prone and complex and thus less failsafe and scalable. Rig has to act as infrastructure manager and keep track of when servers fail. In case any servers fail, rig then has to ensure that the configurations allocated to those servers are copied over to any new one that takes it place. We will be using a mix of dedicated cloud and bare metal servers, so it is not always possible to spin up a machine on the fly. Runtime provisioning is far more elegant where if one machine fails, we can bring another machine up to take its place. Couple this with failover IPs and anycast and you get zero-bottleneck media infrastructure that can work across providers. This is the job of a separate infrastructure monitoring unit. While pre-provisioning arguably works with 10 users during early beta, this pre-allocation quickly becomes troublesome the moment you have to scale beyond 50-60 users and it goes out of hand the moment you have 3 or 4 servers in the same region. Instead, it is more secure, flexible, failsafe and scalable to have agents on edge servers get config from rig at runtime. These are standard media serving practices.
If we need to add limitations, e.g., ensuring a user can only connect to a specific cluster, or user should be able to stream for x minutes only, then that can be smoothly handled at the API layer. So, as I said, nothing changes from the user perspective and from the requirement. Except, we get a future-proof setup, where even if the media server changes to something else apart from nginx-rtmp, it will happen smoothly and the user experience still remains the same.

debloper commented 6 years ago

Why not POST /user/?

👍

client side password hashing makes no sense when using TLS. If your TLS layer is compromised, then well, you have bigger issues.

Understood, but key point here is do not expect the password to be as-is. If I send API a similarly hashed password every time (as of the account creation), it should work.

I will still want to restrict it to at least 4.

Make it 1 and we gucci. If someone has weak password, that can be easily guessed/bruteforced, that's none of our concern. It doesn't affect the system. 🔢💯💯 is arguably lot stronger password than most 8 character normal/special text password.

Does this mean that you don't want to ask user to provide old password to update to a new password?

No. I am just suggesting make them old and new. The endpoint as well as the payload makes it clear that it's password.

This (provisionStatus) includes server and region info.

Already exposed through the ingest URL.

rig will be able to choose which ingestion server to use for a known service depending on which of our edge server is receiving the stream.

No need, as the user explicitly sets the ingest URLs for the services, and the services (Twitch, YouTube) do their own magic to achieve this. We don't need to add overhead here.

Fetching stream key through OAuth access is one of the quick ways to set things up for a user and something that is on the cards.

That has nothing to do with the suggestion for that particular API. For that particular API (which can be reused for the purpose you're explaining) it doesn't need to be aware of the service names/type/properties. Even invalid/defunct URLs should go through, as long as they match the ingest URL properties (/rtmp://.+\..+\/.*/)

Rig knows which servers are available for which service.

SMH... Just why!?!

We'll stick with runtime provisioning of streams.

This makes me feel like 1hr+ phone conversations are never fruitful. So, I'll better use this medium of communication make my point.

Very abstract reason: we don't need this now.

Even with this xplex multistream is not equipped to deal with a Zombie apocalypse and that's fine.
Nerve gas can very well act as mosquito repellent, but a good knight coil is more feasible option.

Less than 30000ft view reason(s):

We are not building social media, there's no sudden boom/slashdotting here. Our entry to beta even if it goes horrendous multi-streaming kablooie means we'll have about 100-200 regular users (requiring static infra/ingest, and long stream duration), and a few hundreds (maybe a thousand) of volatile users (who don't require static infra/ingest, mostly short duration streams). So, the scale where your implementation will generate significant comparative results, is not there.
For some weird reason, if that becomes a false assumption when we go public (a few thousands of users are immediately interested, active and trying to start streaming)
- Ahead-of-time provisioning will tell them no can do, 2bz4strim when it runs out of resoures before giving them the ingest URL. We can go on twitter/reddit and tell users that we're sorry; that we couldn't predict this much amount of love humps and preparing to scale up in a few hours.
- With just-in-time provisioning, users start streaming, but after one point nothing happens cause all out of resources and there's no reasonable way to tell that over rtmp://. If one user stops streaming, another win a lottery and goes through - but that reminds me of IRCTC. Users get frustrated, and they post on twitter/reddit and we have to react to it.
- While the solution is same for both the cases (buy more servers), the experience is not.
Non-volatile users (xplex-partners) will have their streams provisioned in a better infra with more flexible limits and less congested stream-tenants. The overall load there for that infra is very predictable.
Volatile users (non-partners), as we've discussed this multiple times, will get their ingestion URLs while starting the stream, and it will be based off of the available resources, will accept stream for next 24hours and be on different infra than the xplex-partner infra.
With all that said, if your argument is
- "but JIT is easier to implement than AOT" then you need to prove that how so.
- "but agents need to have bi-di comms with rig" then you need to rationalize the part that for the reason given (stat, errors) both the ends don't need to be proactive/smart. Even for AOT implementation it's bi-di (rig asks provision this, agent tells hang on... hang on... done!). My point was, agent shouldn't proactively start comms - it should do its job as henchman of rig.

To wrap up, if we take the problem and make it a bundle of multiple (deterministic) P problems, it's way easier to handle it (even at a conceptual level), than considering the whole thing as an (indeterministic) NP problem and adding innovative abstractions in order to solve it. Do not misunderstand that I'm undermining the solutions you've come up with, and I'm very sure if we get the wind we'll very much need those (at least as learning/foresight). It's part of the reason that we go a long way - the level of abstraction/meta which we can handle and make sense out of.

My point is, we have the resources for a limited amount of time that can serve a limited amount of users. Let's build accordingly. When we have the resources and backing to build for an even larger scale, I am sure we can do it then, and given how modular our system is - it'll never be problematic to integrate it.

4WH about the volatile/non-volatile user-base is way beyond the scope of this discussion, so skipping it for now. If you're having trouble understanding my point, ask @pawzoned to read this part and translate that to kiddo-speak.

kaustavdm commented 6 years ago

I'll keep this short without going into detail else the chain will continue.

Your statements and approach are full of assumptions. In some cases you have confused assumptions as reality. In some other cases you have made too light of things that are actually difficult. Your assumptions are not validated. This is normal. You have not worked at the ground level. So, you are not aware of where things fail and will fail.

I, with significant help from Jai, have written and tweaked the current system from the ground up each time an assumption failed. We now have something that works for all our potential scenarios. It can ensure even spread, dedicated allocations, partner/non-partner access, can work with 10 users or 100K users. Every business requirement can be implemented on top of the current system.

So, we have two ways that we can proceed with this. a) You realize that I have built what I have built so far after enough testing and go ahead with the current system and worry about minor changes later on (prioritize delivery over opinion). or, b) You build this layer yourself and validate your assumptions (prioritize opinion over delivery).

If you choose (a), I will work on the rest of the business layer and the infra. If you choose (b), I'll work on the frontend. In either case, our priority should be to get as many people to use this as we can.

This is a B2C2B game. Earning from the final B layer depends on the size and success of the C layer. It's more work than it sounds. We should not aim to build something that will serve only a 100 people, even though we may currently have the capacity to serve only a 100.

I know this system works. So let's focus on the part where we take this out to the masses and identify and test our assumptions.

debloper commented 6 years ago

(a)

kaustavdm commented 6 years ago

Is that incomplete or intended? :open_mouth:

applait / xplex-rig

List of changes required #38

Regarding provisioning