DzhideX commented 3 years ago

Backend:

I've found a guide on how to run node.js code on aws during my research and I think the diagram found there [picture below] is pretty much exactly along the lines of what I was thinking of (with little changes).

aws-architecture

The node.js app will take care of the functionality for all the different endpoints we may need.
We can use dynamoDB to store all the user data [atlas, youtube, videos, etc..], it should be very fast, reliable and also scales horizontally to support tables of virtually any size.
To automate transfer of videos from Youtube to Atlas, we will need to download the videos and reupload them to Atlas. While doing this, we need to keep these videos somewhere and this is what we will use Amazon S3 for.
The one difference from the diagram is that I think we won't need Amazon SNS (a notification service)

Functionality:

We will need one endpoint [POST] (the main one) that will take atlas, youtube channel data and video data and from this we can create a user inside the database. Upon doing this, we can start (from oldest video maybe) downloading videos from youtube and moving them to atlas. The state of the synchronisation should I think be kept in the database so in case anything goes wrong we can start over without any problems.
We also need functionality for consistently checking for new videos and making sure that all new videos are uploaded to atlas as well. I was thinking that doing this on an interval would be best/easiest (whatever the acceptable value be [1h, 2h, 12h, 24h, etc..]) but there is the problem of this quickly ramping up to an insane amount of request that need to be done (1 for every user, 10 million users = 10 million requests) per every check. (but I'm not sure if this is something that can be avoided anyways)
After that, it should be relatively trivial to add other endpoints [GET] for one to be able to query the system for channel data, video data and other important metrics.

Final notes:

Wrt this, I don't think my experience really allows me to try and predict any more specifics than what I have already written. I have a pretty good idea of how it should work generally and think I could start coding it up right now.
Is downloading videos from youtube allowed? I've found this thread but am still not 100% sure I understood if we can do it even with explicit permission from the user who owns the content.

Frontend:

The functionality on the frontend shouldn't be too complex at all and could live anywhere really (Atlas, joystream-org, ..). I think it would realistically make the most sense for the users to have it on Atlas for the sake of continuity of functionality.
I am not yet certain what the full extent of the functionality on the app will need to be but the basics should go something like:
- Form where the user will be able to prove that he indeed owns an Atlas and Youtube channel along with accepting some TOS and the like.
- After successfully submitting the data, they will be shown a UI where they will be able to follow the progress of the uploads along with possibly any other related info?
- Something else?

Final notes:

We would preferably want some designs for this.

General steps

I've tried to make a diagram for this but it was hard not to make it confusing as one part of this is from user perspective and one from the perspective of the underlying system. After fleshing this out further I think a diagram can be made but currently what happens from the users perspective after logging into the form is largely unknown (from my POV). Steps (simplified):

1. Atlas channel owner opens web application 2. User needs to prove he owns channels on Youtube and Atlas. 3. Get all necessary data and create user in the database. 4. Start synching process. This means that a video needs to be downloaded (from YT) and uploaded (to Atlas). This should be done one by one, both to save storage space but also due to things like space constraints on Atlas and error logging. 5. The system should every so often (perhaps once a day), go through all users and check if there were any new updates and add any new videos to the system.

Finally: I think it may make the most sense to start from the frontend as that would make it easier for us to currently only implement the most important features in the backend and therefore shorten the time to completion of the MVP.

bedeho commented 3 years ago

There are a few key ongoing computational tasks in this system

monitor YT to schedule new content for publishing on Joystream
actually downloading assets from YT, that have been found in monitoring, to some temporary storage area.
publishing content on Joystream by issuing the correct transactions, and in an orderly fashion.
uploading assets from temporary storage area to storage providers in Joystream.
accepting new channels submitted by users and creating channels on Joystream.

It is not clear how you are organizing all of this processing. In principle, all of this can be done independently, and probably should if there is going to be any hope of servicing even a modest number of channels and content. But if you are going to have independent processes that are going to be doing different parts of this work, how are they organized and coordinated?

Front-End

That seems fine, don' worry about designs for now.

License

The user owns their content, so when they authorize us to download it for them its effectively them recovering their own content for their own purposes, which is something they have the right to.

DzhideX commented 3 years ago

This is a really good question. I gave it a bit of thought and this is what I came up with:

1. I think my idea from above for this is really solid. We'll have a process running on an interval querying youtube for any new uploaded videos that we aren't already aware of. This process will only update the database.

2. (3. & 4.) Initially, I wasn't really sure how to deal with these computations but I have an idea now and you can let me know if you think this works or not. I was thinking of combining step 2,3 and 4 and creating another process (independent from the first) that will also run on an interval (though I think this one should be shorter). As for why I think we should combine these steps, this is my reasoning:

If we download all assets from every user right after adding them to the db, especially in the case of many users, we could be paying an unnecessary amount of money to just keep the assets without even knowing whether or not a user can upload all of them (example: user has 20 videos * 3 gb but only has 10gb of allowed storage on the platform, the rest of those videos will be sitting in the storage doing nothing). If it's all in one process, we can go one by one and make sure we only hold one asset at a time.
The reason why I think it should be on an interval as well is because of something like this. We have the same user from before. He can only fit three videos due to the 10 gb limit. He asks the storage lead to give him more and the storage lead gives him more storage space to work with. This could for example happen to hundreds of different people at a bunch of different times during the day. This way, we can check for storage every (say) 2 hours and upload as many videos as storage allows.
So the whole process would be: download assets for one video from YT to Amazon S3, publish that video to Joystream with the correct transactions and upload assets from S3 to Joystream, delete that video from S3 and start downloading the next video. This process would go on as long storage allows.
The only thing I'm wondering about in this case is if a user wants to upload a new video but can't because the synching service keeps filling up the storage with videos from youtube?

3. (5) Wrt to the last one, I'm not sure I quite understood this part?

To recap and add a little more info the main /user [POST] endpoint:

We would have one process querying yt and adding new videos the database.
We would have another process checking allowed storage and uploading videos for the users based on the info from the database.
We will have one endpoint (main) which will accept a new user and update the db accordingly. After that, we will immediately upload as many videos as storage allows (identically to the second process).
We can have as many other endpoints to query the db for metrics and other information.

DzhideX commented 3 years ago

After learning some new things, I've decided to change up my current approach a little when it comes to the architecture. With the approach described before, we would essentially host our own server on aws infrastructure to run 24/7, very similarly to how one would with a traditional approach. This is not very aligned with the pay-as-you-go/use promise that cloud usually affords us. It also comes with some other unnecessary complexities that this new approach would abstract for us.

The new idea is to use AWS Lambda. This is essentially "Functions as a service" where one only pays for functionality one actually uses. It will also be substantially easier to implement as a lot of the configuration and general complexities that come with building a rest api from scratch are done by aws for us. The idea is to (most probably) have 3 lambda functions and put 1 of them behind an api gateway trigger (think, regular api) and have 2 of them be triggered on a timer.

I've also taken a look at the code at lbryio/yt-synch (which essentially does the same thing that we're trying to do here) and have been able to get some takeaways here as well:

We don't really need Amazon S3 (storage) at all as I've initially imagined but that does come with some caveats. LBRY/yt-synch downloads a video from youtube, keeps it in memory and uploads it to their platform all as a part of one process. I was initially planning on doing the same thing but am not 100% sure how that is going to work with Lambdas as these functions have a max timeout of 15 minutes and 10GB of allowed memory usage. From my POV, I think this shouldn't be a bottleneck, but if that turns out to be the case then we should probably revert to the previous approach.
The way they synch is that they add videos to a queue and assign them to workers (which enables concurrency [but for them by default max worker number is 1 though..]). With our current approach, this shouldn't really present a problem as we can run 3000 (possibly more) concurrent lambda function executions. If this doesn't work out for us, we would probably need to either completely revert to the previous approach or some hybrid between the two.
The reason behind me wanting to use Lambda functions as much despite some of the possible problems we may have is due to the speed and scope at which we should be able to process all this data. After trying to synch my youtube channel on odysee.com, being greeted by "Please check back later. This may take up to 1 week." does not inspire faith. I may be missing something and this may be an unavoidable fact one needs to face when working on something like this but I'd like to cross that bridge when I get there.

I think the best course for now would be to start working based on all of this information and then adapt the implementation as necessary and as the situation warrants.

bedeho commented 3 years ago

We don't really need Amazon S3 (storage) at all as I've initially imagined but that does come with some caveats. LBRY/yt-synch downloads a video from youtube, keeps it in memory and uploads it to their platform all as a part of one process.

This would be awesome if true, saves a lot of extra pain. Less state, less pain.

The reason behind me wanting to use Lambda functions as much despite some of the possible problems we may have

What problems will Lambdas possibly cause?

bedeho commented 3 years ago

Will Lambdas bind us to AWS? I don't love that :(

DzhideX commented 3 years ago

What problems will Lambdas possibly cause? Will Lambdas bind us to AWS? I don't love that :(

It wouldn't really bind us to AWS at all as it would essentially just be functions called just as if one extracted them from a node.js server meaning it would also be just as easy to just implement/put them back. That being said, Lambdas indeed don't seem to be the right tool for the job. The 15 min timeouts and 10 GB max memory usage for example would make us resort to making a lot of requests concurrently which would not only drive up the price but also, if under a lot of load, probably result in problems with being able to deal with all of it. This type of task is indeed much more suited to a synchronous approach.

Update on processing:

Found that youtube offers something called youtube push notifications. This would make the process of keeping track of new uploads trivial and would be the best approach. Will try to implement this before trying the API polling approach

DzhideX commented 3 years ago

Due to my "recent" finding that dynamoDB doesn't allow more than 400kb per one put call (this can be worked around) I came to wondering something. Hope you @bedeho can shine some light on it:

400kb with the current data I store seems to be above 550 videos. Is it even realistic to assume that we want to synch this many videos considering the 10GB limit. So in that light, do we want to limit this number somehow?
Also wrt to the limits. Would you mind explaining the process you had in mind for this? The idea in my mind was to synch until you can't anymore and then after there is more space continue synching. The problem with this approach is, what if the user wants to upload another video to atlas while it's synching? There's probably other things I haven't thought of and I would love some more detail on this!

bedeho commented 3 years ago

400kb with the current data I store seems to be above 550 videos. Is it even realistic to assume that we want to synch this many videos considering the 10GB limit. So in that light, do we want to limit this number somehow?

I don't really understand the context of this question. If you are asking whether there could be any time where the infrastructure will be more than 550 videos behind synchronizing Joystream with Youtube, then that is certainly true. Even with the 10GB limit, which is in no way immutable, if 5 people sign up within a small period of time, each with > 110 videos on average, then that could easily happen. You could also easily come into situations where, some channels have many small videos, or that the infrastructure lags far behind, for example due to some sort of crash, and when catching up it is thousand or videos behind.

Not sure if any of this applies.

Also wrt to the limits. Would you mind explaining the process you had in mind for this? The idea in my mind was to synch until you can't anymore and then after there is more space continue synching. The problem with this approach is, what if the user wants to upload another video to atlas while it's synching? There's probably other things I haven't thought of and I would love some more detail on this!

This question is also a bit unclear, but if I understand you correctly, then yes, the synching infrastructure can end up competing with manual uploads to a given channel. This simply means that the synchronization infrastructure needs to be aware of upload limits and stop trying to synchronize a channel which has no more space, or too little effective space left.

Please clarify if this is not what you had in mind.

DzhideX commented 3 years ago

Reading back the first question, it's quite unclear what I meant exactly so I'll try to explain it better. I wanted to ask if we want to limit the number of videos associated with one user to a certain number? Like, if a user has 1000 (not too unlikely from my experience, especially for more popular YT-ers) or 5000 videos associated with their Youtube channel, do we want to upload all of those to Joystream or maybe only 300-400 latest videos or something else? (Let's disregard the 400kb part as it isn't too important).

The second part answers my question :+1:

bedeho commented 3 years ago

Reading back the first question, it's quite unclear what I meant exactly so I'll try to explain it better. I wanted to ask if we want to limit the number of videos associated with one user to a certain number? Like, if a user has 1000 (not too unlikely from my experience, especially for more popular YT-ers) or 5000 videos associated with their Youtube channel, do we want to upload all of those to Joystream or maybe only 300-400 latest videos or something else? (Let's disregard the 400kb part as it isn't too important).

Having a limit could make sense, as well as the ability to selectively change limits for individual channels manually, could be useful. We should also be able to pause and unpause synching for a given channel also, so per channel settings seem inevitable.

Joystream / youtube-synch

Initial thoughts (architecture, implementation,..) #3

Backend:

Frontend:

General steps

Front-End

License