Joystream / youtube-synch

YouTube Synchronization
11 stars 10 forks source link

Initial thoughts (architecture, implementation,..) #3

Closed DzhideX closed 2 years ago

DzhideX commented 3 years ago

Backend:

I've found a guide on how to run node.js code on aws during my research and I think the diagram found there [picture below] is pretty much exactly along the lines of what I was thinking of (with little changes).

aws-architecture

Functionality:

Final notes:

Frontend:

Final notes:

General steps

I've tried to make a diagram for this but it was hard not to make it confusing as one part of this is from user perspective and one from the perspective of the underlying system. After fleshing this out further I think a diagram can be made but currently what happens from the users perspective after logging into the form is largely unknown (from my POV). Steps (simplified):

1. Atlas channel owner opens web application 2. User needs to prove he owns channels on Youtube and Atlas. 3. Get all necessary data and create user in the database. 4. Start synching process. This means that a video needs to be downloaded (from YT) and uploaded (to Atlas). This should be done one by one, both to save storage space but also due to things like space constraints on Atlas and error logging. 5. The system should every so often (perhaps once a day), go through all users and check if there were any new updates and add any new videos to the system.

Finally: I think it may make the most sense to start from the frontend as that would make it easier for us to currently only implement the most important features in the backend and therefore shorten the time to completion of the MVP.

bedeho commented 3 years ago

There are a few key ongoing computational tasks in this system

  1. monitor YT to schedule new content for publishing on Joystream
  2. actually downloading assets from YT, that have been found in monitoring, to some temporary storage area.
  3. publishing content on Joystream by issuing the correct transactions, and in an orderly fashion.
  4. uploading assets from temporary storage area to storage providers in Joystream.
  5. accepting new channels submitted by users and creating channels on Joystream.

It is not clear how you are organizing all of this processing. In principle, all of this can be done independently, and probably should if there is going to be any hope of servicing even a modest number of channels and content. But if you are going to have independent processes that are going to be doing different parts of this work, how are they organized and coordinated?

Front-End

License

The user owns their content, so when they authorize us to download it for them its effectively them recovering their own content for their own purposes, which is something they have the right to.

DzhideX commented 3 years ago

This is a really good question. I gave it a bit of thought and this is what I came up with:

1. I think my idea from above for this is really solid. We'll have a process running on an interval querying youtube for any new uploaded videos that we aren't already aware of. This process will only update the database.

2. (3. & 4.) Initially, I wasn't really sure how to deal with these computations but I have an idea now and you can let me know if you think this works or not. I was thinking of combining step 2,3 and 4 and creating another process (independent from the first) that will also run on an interval (though I think this one should be shorter). As for why I think we should combine these steps, this is my reasoning:

3. (5) Wrt to the last one, I'm not sure I quite understood this part?

To recap and add a little more info the main /user [POST] endpoint:

DzhideX commented 3 years ago

After learning some new things, I've decided to change up my current approach a little when it comes to the architecture. With the approach described before, we would essentially host our own server on aws infrastructure to run 24/7, very similarly to how one would with a traditional approach. This is not very aligned with the pay-as-you-go/use promise that cloud usually affords us. It also comes with some other unnecessary complexities that this new approach would abstract for us.

The new idea is to use AWS Lambda. This is essentially "Functions as a service" where one only pays for functionality one actually uses. It will also be substantially easier to implement as a lot of the configuration and general complexities that come with building a rest api from scratch are done by aws for us. The idea is to (most probably) have 3 lambda functions and put 1 of them behind an api gateway trigger (think, regular api) and have 2 of them be triggered on a timer.

I've also taken a look at the code at lbryio/yt-synch (which essentially does the same thing that we're trying to do here) and have been able to get some takeaways here as well:

I think the best course for now would be to start working based on all of this information and then adapt the implementation as necessary and as the situation warrants.

bedeho commented 3 years ago

We don't really need Amazon S3 (storage) at all as I've initially imagined but that does come with some caveats. LBRY/yt-synch downloads a video from youtube, keeps it in memory and uploads it to their platform all as a part of one process.

This would be awesome if true, saves a lot of extra pain. Less state, less pain.

The reason behind me wanting to use Lambda functions as much despite some of the possible problems we may have

What problems will Lambdas possibly cause?

bedeho commented 3 years ago

Will Lambdas bind us to AWS? I don't love that :(

DzhideX commented 3 years ago

What problems will Lambdas possibly cause? Will Lambdas bind us to AWS? I don't love that :(

It wouldn't really bind us to AWS at all as it would essentially just be functions called just as if one extracted them from a node.js server meaning it would also be just as easy to just implement/put them back. That being said, Lambdas indeed don't seem to be the right tool for the job. The 15 min timeouts and 10 GB max memory usage for example would make us resort to making a lot of requests concurrently which would not only drive up the price but also, if under a lot of load, probably result in problems with being able to deal with all of it. This type of task is indeed much more suited to a synchronous approach.

Update on processing:

DzhideX commented 3 years ago

Due to my "recent" finding that dynamoDB doesn't allow more than 400kb per one put call (this can be worked around) I came to wondering something. Hope you @bedeho can shine some light on it:

bedeho commented 3 years ago

400kb with the current data I store seems to be above 550 videos. Is it even realistic to assume that we want to synch this many videos considering the 10GB limit. So in that light, do we want to limit this number somehow?

I don't really understand the context of this question. If you are asking whether there could be any time where the infrastructure will be more than 550 videos behind synchronizing Joystream with Youtube, then that is certainly true. Even with the 10GB limit, which is in no way immutable, if 5 people sign up within a small period of time, each with > 110 videos on average, then that could easily happen. You could also easily come into situations where, some channels have many small videos, or that the infrastructure lags far behind, for example due to some sort of crash, and when catching up it is thousand or videos behind.

Not sure if any of this applies.

Also wrt to the limits. Would you mind explaining the process you had in mind for this? The idea in my mind was to synch until you can't anymore and then after there is more space continue synching. The problem with this approach is, what if the user wants to upload another video to atlas while it's synching? There's probably other things I haven't thought of and I would love some more detail on this!

This question is also a bit unclear, but if I understand you correctly, then yes, the synching infrastructure can end up competing with manual uploads to a given channel. This simply means that the synchronization infrastructure needs to be aware of upload limits and stop trying to synchronize a channel which has no more space, or too little effective space left.

Please clarify if this is not what you had in mind.

DzhideX commented 3 years ago

Reading back the first question, it's quite unclear what I meant exactly so I'll try to explain it better. I wanted to ask if we want to limit the number of videos associated with one user to a certain number? Like, if a user has 1000 (not too unlikely from my experience, especially for more popular YT-ers) or 5000 videos associated with their Youtube channel, do we want to upload all of those to Joystream or maybe only 300-400 latest videos or something else? (Let's disregard the 400kb part as it isn't too important).

The second part answers my question :+1:

bedeho commented 3 years ago

Reading back the first question, it's quite unclear what I meant exactly so I'll try to explain it better. I wanted to ask if we want to limit the number of videos associated with one user to a certain number? Like, if a user has 1000 (not too unlikely from my experience, especially for more popular YT-ers) or 5000 videos associated with their Youtube channel, do we want to upload all of those to Joystream or maybe only 300-400 latest videos or something else? (Let's disregard the 400kb part as it isn't too important).

Having a limit could make sense, as well as the ability to selectively change limits for individual channels manually, could be useful. We should also be able to pause and unpause synching for a given channel also, so per channel settings seem inevitable.