ajayyy / SponsorBlockServer

Skip YouTube video sponsors (server side portion)
https://sponsor.ajay.app
GNU Affero General Public License v3.0
825 stars 65 forks source link

Bounty: Public database syncing system (150$) #373

Closed ajayyy closed 2 years ago

ajayyy commented 2 years ago

There should be a system that allows anyone to be able to have a near-realtime (5 mins optimal, 20 mins at least) mirror at least the sponsorTimes table.

This was discussed in Discord(message) / Matrix(message).

Possible options:


The solution should be able to be run easily when starting up the SponsorBlock server. Docker would probably be needed to ease deployment if it requires external dependencies outside of just node.


Bounty has been solved. Bounty of $150 USD via PayPal to whoever solves this issue. To avoid duplicate work, make sure to comment on this issue with what you are doing.

mchangrh commented 2 years ago

I'm not sure how well useful IPFS chunking would be as a large amount of the CSV is modified in day-to-day operations, most notably votes. You can try this yourself by chunking and deduping different days from my historical archive

http://sb-archive.mchang.xyz/mirror/

ajayyy commented 2 years ago

Postgres approach is probably the best, but it does restrict all mirrors to have to use Postgres instead of Sqlite

mchangrh commented 2 years ago

braindump while I finish my readings

https://gist.github.com/mchangrh/d8e2837b372756c867429eb20e946798

mchangrh commented 2 years ago

I think I have a very basic implementation that would only require

and high bandwidth RSync+nginx server(s) for hosting but that's quite obvious + I am willing to provide (offtopic)

https://gist.github.com/mchangrh/3c3a9e870dfc1b8aee574c791dab83ad

tl;dr

master (Ajay)

storage (single/ geodistributed) / T1 Mirror

active mirror

mirror from scratch / back from the dead

ajayyy commented 2 years ago

high bandwidth RSync+nginx server(s)

Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much.

mchangrh commented 2 years ago

Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much.

the point would be to offload the bandwidth from the main server on to these mirrors, but they would probably be serving exports as well as these base_dumps and WALs, but if they are slower than the main server or just generally bad, they'll probably just fall back on the main server

ajayyy commented 2 years ago

I guess that part really depends on the amount of uptake on it past services like sb.ltn.fi, but that's a good point. Seperation of concerns is nice

mchangrh commented 2 years ago

the main advantage is that the tier 1 mirrors can be dumb

small amount of storage (1-3GB) with minimal CPU usage and some RAM since all they have are

if the primary storage server isn't generating the archive, then tar which is minimal and db generation can be on a docker runner

stranger-danger-zamu commented 2 years ago

I found the addon today and it's great!

You could also abuse HTTP range requests and host an index to the CSV lines based on time. So the program just asks the next CSV slice based on it's newest ID and the server just sends the correct range (and possibly a URL). Then the program just downloads just the new lines.

The beauty of this approach is that any plain old nginx instance could do this without any major changes. Additionally since the server is serving as an index, then we can point to mirrors and round robin the load.


Real talk though. Compress those CSV files. Even gzip is good enough:

original file original size gzip'd size xz'd size
categoryVotes.csv 3.4M 1.2M 996K
lockCategories.csv 3.0M 292K 216K
sponsorTimes.csv 635M 272M 208M
unlistedVideos.csv 5.5M 1.8M 1.2M
userNames.csv 5.1M 2.6M 2.4M
vipUsers.csv 4.0K 4.0K 4.0K
warnings.csv 84K 32K 28K

GZIP and XZ are both installed on most systems and anyone downloading these files from the website should know how to deal with it.

If they are being used programmatically for updating, there are ways of just opening the compressed file and reading it in all the popular languages.

mchangrh commented 2 years ago

would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too

stranger-danger-zamu commented 2 years ago

would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too

I don't see why you couldn't just have the update logic just overwrite the record if it encounters the same ID. (Essentially just upsert each record into based on the ID).

While duplication is an issue, you can also provide a periodic base file and if someone if too far back they just redownload the entire base file. These base files would just be snapshots of the database dumped periodically to compress the updated fields.

It definitely keeps it simple which I think is the biggest benefit.

mchangrh commented 2 years ago

1/2 done - distribution and downloading of files

gist explaining flags and why lz4 was chosen (w/ benchmarks) https://gist.github.com/mchangrh/3d4a967732f8994cfcfdc05a8e22dc4f

docker container for mirroring csv files https://github.com/mchangrh/sb-mirror

aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves)

whizzzkid commented 2 years ago

It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun

mchangrh commented 2 years ago

It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun

Very cool but I don't know if it would fit SponsorBlock since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal.

mchangrh commented 2 years ago

1/2 done - distribution and downloading of files

gist explaining flags and why lz4 was chosen (w/ benchmarks) https://gist.github.com/mchangrh/3d4a967732f8994cfcfdc05a8e22dc4f

docker container for mirroring csv files https://github.com/mchangrh/sb-mirror

aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves)

2/2 done - with mode=mirror in config, almost all of the work is done

only things missing would be

If everything else is okay, then I'll do a quick change from alpine to debian for rsync's lz4 - pending aports merge

whizzzkid commented 2 years ago

@mchangrh I think that's a misconception, there is no "single source of truth" which means there are multiple sources of truth, as each node just stores the graph that concerns them. The entire database is a sum of all nodes. It resolves the need for distributing and syncing databases, it's handled automatically by the Gun protocol.

since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal.

Not really, by that logic crypto currencies could never exist. We can reach out to the gun team on gitter to see how they can help this project.

mchangrh commented 2 years ago

Not really, by that logic crypto currencies could never exist.

Yes but sponsorblock isn't a cryptocurrency, it doesn't need a single source of truth, the source of truth is done by verifying with other peers. Quite a while ago, when the BTC market was booming, there were concerns since there was a single pool that mined 3 consequitive blocks. If a malicious actor wanted, they could spin up a majority of gun peers and override most of the database. It's cool tech but doesn't fit in with master/slave and single source of truth that sponsorblock requires.

This is required by sponsorblock since the vipUsers table should only be accessible to the admin, a bad actor could change the vipUsers table and remove, add locked segments or ban users.

It's very cool but even if we only use gun for only sponsorTimes there would still be the issues of distributing the other tables.

ajayyy commented 2 years ago

This is now complete! Check out @mchangrh's project to see how to host a mirror yourself: https://github.com/mchangrh/sb-mirror