Closed DzhideX closed 2 years ago
There are a few key ongoing computational tasks in this system
It is not clear how you are organizing all of this processing. In principle, all of this can be done independently, and probably should if there is going to be any hope of servicing even a modest number of channels and content. But if you are going to have independent processes that are going to be doing different parts of this work, how are they organized and coordinated?
The user owns their content, so when they authorize us to download it for them its effectively them recovering their own content for their own purposes, which is something they have the right to.
This is a really good question. I gave it a bit of thought and this is what I came up with:
1. I think my idea from above for this is really solid. We'll have a process running on an interval querying youtube for any new uploaded videos that we aren't already aware of. This process will only update the database.
2. (3. & 4.) Initially, I wasn't really sure how to deal with these computations but I have an idea now and you can let me know if you think this works or not. I was thinking of combining step 2,3 and 4 and creating another process (independent from the first) that will also run on an interval (though I think this one should be shorter). As for why I think we should combine these steps, this is my reasoning:
If we download all assets from every user right after adding them to the db, especially in the case of many users, we could be paying an unnecessary amount of money to just keep the assets without even knowing whether or not a user can upload all of them (example: user has 20 videos * 3 gb but only has 10gb of allowed storage on the platform, the rest of those videos will be sitting in the storage doing nothing). If it's all in one process, we can go one by one and make sure we only hold one asset at a time.
The reason why I think it should be on an interval as well is because of something like this. We have the same user from before. He can only fit three videos due to the 10 gb limit. He asks the storage lead to give him more and the storage lead gives him more storage space to work with. This could for example happen to hundreds of different people at a bunch of different times during the day. This way, we can check for storage every (say) 2 hours and upload as many videos as storage allows.
So the whole process would be: download assets for one video from YT to Amazon S3, publish that video to Joystream with the correct transactions and upload assets from S3 to Joystream, delete that video from S3 and start downloading the next video. This process would go on as long storage allows.
The only thing I'm wondering about in this case is if a user wants to upload a new video but can't because the synching service keeps filling up the storage with videos from youtube?
3. (5) Wrt to the last one, I'm not sure I quite understood this part?
To recap and add a little more info the main /user
[POST] endpoint:
After learning some new things, I've decided to change up my current approach a little when it comes to the architecture. With the approach described before, we would essentially host our own server on aws infrastructure to run 24/7, very similarly to how one would with a traditional approach. This is not very aligned with the pay-as-you-go/use promise that cloud usually affords us. It also comes with some other unnecessary complexities that this new approach would abstract for us.
The new idea is to use AWS Lambda. This is essentially "Functions as a service" where one only pays for functionality one actually uses. It will also be substantially easier to implement as a lot of the configuration and general complexities that come with building a rest api from scratch are done by aws for us. The idea is to (most probably) have 3 lambda functions and put 1 of them behind an api gateway trigger (think, regular api) and have 2 of them be triggered on a timer.
I've also taken a look at the code at lbryio/yt-synch (which essentially does the same thing that we're trying to do here) and have been able to get some takeaways here as well:
We don't really need Amazon S3 (storage) at all as I've initially imagined but that does come with some caveats. LBRY/yt-synch downloads a video from youtube, keeps it in memory and uploads it to their platform all as a part of one process. I was initially planning on doing the same thing but am not 100% sure how that is going to work with Lambdas as these functions have a max timeout of 15 minutes and 10GB of allowed memory usage. From my POV, I think this shouldn't be a bottleneck, but if that turns out to be the case then we should probably revert to the previous approach.
The way they synch is that they add videos to a queue and assign them to workers (which enables concurrency [but for them by default max worker number is 1 though..]). With our current approach, this shouldn't really present a problem as we can run 3000 (possibly more) concurrent lambda function executions. If this doesn't work out for us, we would probably need to either completely revert to the previous approach or some hybrid between the two.
The reason behind me wanting to use Lambda functions as much despite some of the possible problems we may have is due to the speed and scope at which we should be able to process all this data. After trying to synch my youtube channel on odysee.com, being greeted by "Please check back later. This may take up to 1 week." does not inspire faith. I may be missing something and this may be an unavoidable fact one needs to face when working on something like this but I'd like to cross that bridge when I get there.
I think the best course for now would be to start working based on all of this information and then adapt the implementation as necessary and as the situation warrants.
We don't really need Amazon S3 (storage) at all as I've initially imagined but that does come with some caveats. LBRY/yt-synch downloads a video from youtube, keeps it in memory and uploads it to their platform all as a part of one process.
This would be awesome if true, saves a lot of extra pain. Less state, less pain.
The reason behind me wanting to use Lambda functions as much despite some of the possible problems we may have
What problems will Lambdas possibly cause?
Will Lambdas bind us to AWS? I don't love that :(
What problems will Lambdas possibly cause? Will Lambdas bind us to AWS? I don't love that :(
It wouldn't really bind us to AWS at all as it would essentially just be functions called just as if one extracted them from a node.js server meaning it would also be just as easy to just implement/put them back. That being said, Lambdas indeed don't seem to be the right tool for the job. The 15 min timeouts and 10 GB max memory usage for example would make us resort to making a lot of requests concurrently which would not only drive up the price but also, if under a lot of load, probably result in problems with being able to deal with all of it. This type of task is indeed much more suited to a synchronous approach.
Update on processing:
Due to my "recent" finding that dynamoDB doesn't allow more than 400kb per one put call (this can be worked around) I came to wondering something. Hope you @bedeho can shine some light on it:
400kb with the current data I store seems to be above 550 videos. Is it even realistic to assume that we want to synch this many videos considering the 10GB limit. So in that light, do we want to limit this number somehow?
Also wrt to the limits. Would you mind explaining the process you had in mind for this? The idea in my mind was to synch until you can't anymore and then after there is more space continue synching. The problem with this approach is, what if the user wants to upload another video to atlas while it's synching? There's probably other things I haven't thought of and I would love some more detail on this!
400kb with the current data I store seems to be above 550 videos. Is it even realistic to assume that we want to synch this many videos considering the 10GB limit. So in that light, do we want to limit this number somehow?
I don't really understand the context of this question. If you are asking whether there could be any time where the infrastructure will be more than 550 videos behind synchronizing Joystream with Youtube, then that is certainly true. Even with the 10GB limit, which is in no way immutable, if 5 people sign up within a small period of time, each with > 110 videos on average, then that could easily happen. You could also easily come into situations where, some channels have many small videos, or that the infrastructure lags far behind, for example due to some sort of crash, and when catching up it is thousand or videos behind.
Not sure if any of this applies.
Also wrt to the limits. Would you mind explaining the process you had in mind for this? The idea in my mind was to synch until you can't anymore and then after there is more space continue synching. The problem with this approach is, what if the user wants to upload another video to atlas while it's synching? There's probably other things I haven't thought of and I would love some more detail on this!
This question is also a bit unclear, but if I understand you correctly, then yes, the synching infrastructure can end up competing with manual uploads to a given channel. This simply means that the synchronization infrastructure needs to be aware of upload limits and stop trying to synchronize a channel which has no more space, or too little effective space left.
Please clarify if this is not what you had in mind.
Reading back the first question, it's quite unclear what I meant exactly so I'll try to explain it better. I wanted to ask if we want to limit the number of videos associated with one user to a certain number? Like, if a user has 1000 (not too unlikely from my experience, especially for more popular YT-ers) or 5000 videos associated with their Youtube channel, do we want to upload all of those to Joystream or maybe only 300-400 latest videos or something else? (Let's disregard the 400kb part as it isn't too important).
The second part answers my question :+1:
Reading back the first question, it's quite unclear what I meant exactly so I'll try to explain it better. I wanted to ask if we want to limit the number of videos associated with one user to a certain number? Like, if a user has 1000 (not too unlikely from my experience, especially for more popular YT-ers) or 5000 videos associated with their Youtube channel, do we want to upload all of those to Joystream or maybe only 300-400 latest videos or something else? (Let's disregard the 400kb part as it isn't too important).
Having a limit could make sense, as well as the ability to selectively change limits for individual channels manually, could be useful. We should also be able to pause and unpause synching for a given channel also, so per channel settings seem inevitable.
Backend:
I've found a guide on how to run node.js code on aws during my research and I think the diagram found there [picture below] is pretty much exactly along the lines of what I was thinking of (with little changes).
Functionality:
Final notes:
Frontend:
Final notes:
General steps
I've tried to make a diagram for this but it was hard not to make it confusing as one part of this is from user perspective and one from the perspective of the underlying system. After fleshing this out further I think a diagram can be made but currently what happens from the users perspective after logging into the form is largely unknown (from my POV). Steps (simplified):
1. Atlas channel owner opens web application 2. User needs to prove he owns channels on Youtube and Atlas. 3. Get all necessary data and create user in the database. 4. Start synching process. This means that a video needs to be downloaded (from YT) and uploaded (to Atlas). This should be done one by one, both to save storage space but also due to things like space constraints on Atlas and error logging. 5. The system should every so often (perhaps once a day), go through all users and check if there were any new updates and add any new videos to the system.
Finally: I think it may make the most sense to start from the frontend as that would make it easier for us to currently only implement the most important features in the backend and therefore shorten the time to completion of the MVP.