Custom Scratch website scraper (Voyager)

NotFenixio commented 1 year ago

Recently, ScratchDB has been acting up, causing problems for Snazzle, which relies on it. But we can't just toss ScratchDB aside because it's our main source of info. So, here's an idea: Let's create our own ScratchDB.

To do this, we'd learn from ScratchDB's way of doing things. We can use PlayWright, Selenium, and/or Atoma to grab info from Scratch, and then BeautifulSoup to clean it up and get the data we need.

Now, this idea is kind of like a trial run, like taking a poll. I want to know what y'all think about it. Would this be a good move?

EngineerRunner commented 1 year ago

i mean, i'm mostly working on Pyratch now, but this'd be really helpful for all alternative frontends, so i support this idea.

NotFenixio commented 1 year ago

Awesome! We need a name... Any suggestions?

NotFenixio commented 1 year ago

I'll just call it ScratchedDB for now,

redstone-dev commented 1 year ago

I really like this idea, however, we'd need reliable hosting with the closest to 100% uptime we can get. I have an AWS account so we could try that, but it's really expensive so we'd need to get our money's worth out of it.

As for the name, we could call it Voyager. (Thanks ChatGPT)

We could also try writing it in Rust for funzies.

NotFenixio commented 1 year ago

Voyager then! I'll rename the repo in a moment.

The idea is to create a locally-deployable ScratchDB, so whoever downloads Snazzle or any other alternative frontend, will be hosting its own ScratchDB.

Also, I discovered that AWS has a 12-month free tier for Amazon EC2 which we can use to deploy this new thing for 1 year. (A simple Glitch project with UptimeRobot could do the thing too)

And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port.

redstone-dev commented 1 year ago

And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port.

Rust rewrite of Svelte??? /j \

davidtheplatform commented 1 year ago

Random suggestions: Use requests and beautiful soup since it uses way less ram (also there are rss feeds but they don’t have every post)

Have a centralized server to reduce load on scratch but clients have a local cache/scraper in case the server goes down

Clients can choose whether they want stale data immediately or updates data that takes longer to get

redstone-dev commented 1 year ago

Clients can choose whether they want stale data immediately or updates data that takes longer to get

I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user.

davidtheplatform commented 1 year ago

Clients can choose whether they want stale data immediately or updates data that takes longer to get

I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user.

Its probably better to avoid sending requests to Scratch if we don't have to. What I meant was that if the client doesn't care about having the most up-to-date data it can tell the server that so the server doesn't have to make a request to the Scratch servers.

redstone-dev commented 1 year ago

Clients can choose whether they want stale data immediately or updates data that takes longer to get

I think we could combine Voyager with a system on the client that checks if the RSS data has new posts that Voyager doesn't have yet, in which case it sends this data to the central Voyager server and then displays the new data to the user.

Its probably better to avoid sending requests to Scratch if we don't have to. What I meant was that if the client doesn't care about having the most up-to-date data it can tell the server that so the server doesn't have to make a request to the Scratch servers.

To that end, we should also add rate-limiting (maybe only 3 requests a second?) to avoid stressing the Scratch servers. We may need to increase this number based on website traffic, though. Ideally the server should do this automatically somehow.

ajskateboarder commented 1 year ago

We don't need any browser automation tools, Scratch forums are easy to fetch over HTTP requests

If we are using Rust, we can use the reqwest and scraper crates for data and serve it over actix. I can work on it whenever I have free time

To that end, we should also add rate-limiting (maybe only 3 requests a second?) to avoid stressing the Scratch servers. We may need to increase this number based on website traffic, though. Ideally the server should do this automatically somehow.

I think 3 requests/second is fine

NotFenixio commented 1 year ago

We're building it on Python, but we need some help with specific functions that require indexing Scratch. https://github.com/users/NotFenixio/projects/3/views/1

davidtheplatform commented 1 year ago

I’m working on a scraper right now that uses SQLite

davidtheplatform commented 1 year ago

Voyager then! I'll rename the repo in a moment.

The idea is to create a locally-deployable ScratchDB, so whoever downloads Snazzle or any other alternative frontend, will be hosting its own ScratchDB.

Also, I discovered that AWS has a 12-month free tier for Amazon EC2 which we can use to deploy this new thing for 1 year. (A simple Glitch project with UptimeRobot could do the thing too)

And for the Rust thing, I don't know... Let's try doing it in Python and leaving that for the future Snazzle Svelte/Rust port.

Depending on how much load there is I will probably be able to host it

redstone-dev commented 1 year ago

I’m working on a scraper right now that uses SQLite

Since Voyager is already being made by @NotFenixio, I had an idea.

When you both get your ideas usable in Snazzle, we can vote on the better one and we’ll use that. I might create my own entry as well.

Depending on how much load there is I will probably be able to host it

The idea is to create a more reliable service, so we should use the cloud for maximum uptime.

davidtheplatform commented 1 year ago

I’m working on a scraper right now that uses SQLite

Since Voyager is already being made by @NotFenixio, I had an idea.

When you both get your ideas usable in Snazzle, we can vote on the better one and we’ll use that. I might create my own entry as well.

Depending on how much load there is I will probably be able to host it

The idea is to create a more reliable service, so we should use the cloud for maximum uptime.

I’m going to initially run mine on my pi but if you find a free cloud service I can switch it

NotFenixio commented 1 year ago

Voyager Update 1.0.0

Changelog:

Voyager can now retrieve the first 25 topics from a specific category on a specific page (starting from 1).
Includes more information not present in ScratchDB such as Author and Sticky state.

Issues:

Responses don't have time marks (this is going to change for sure).
Only retrieves 25 topics compared to ScratchDB's 50 topics.

redstone-dev commented 1 year ago

Only retrieves 25 posts compared to ScratchDB's 50 posts.

Why only 25? Also, you should maybe announce these things in Voyager's repo.

I'll make an announcement in the Scratch forum thread saying that all Voyager-related concerns should be funneled to the Voyager repo.

Also, we could probably rename it to Voyageur to avoid being confused with the—frankly fascinating—space probe.

NotFenixio commented 1 year ago

Why only 25?

Scratch only shows the first 25 topics per page. I'm working of improving at the broken-more-topics branch in the Voyager repo.

Also, you should maybe announce these things in Voyager's repo.

Yeah, I just wanted y'all to get updates.

Also, we could probably rename it to Voyageur to avoid being confused with the—frankly fascinating—space probe.

I think its better to maintain that name. Another name change could probably break more things.

dynamixbot commented 7 months ago

Voyager should be kept as a different project, developed simultaneously with Snazzle if you are planning to make it as an alternative to ScratchDB

It should be put into two parts, Pioneer, the scraper and Horizons, the DB.

Pioneer and Horizons would work together to form Voyager. Pioneer's sole purpose would be to keep scraping and Horizons' sole purpose would be store the data scraped by Pioneer.

Instead of being built in Python, it should be built with Golang's Colly scraper as it is scalable, efficient, fast, parallely computed and has several built-in functions. The DB would be C++ based ScyllaDB if data is stored locally, or Google Cloud if the applications should be based on the cloud.

Also, should the Voyager system be ran locally, there should be at least 8GB RAM per node and 2TB of High-Speed Storage per node. The ideal candidate for making a local server would be 4 Raspberry Pi Compute Module 4's specced at 8GB of RAM and no eMMC. The carrier board for it would be the Turing Pi 2.5. Each node would have 2TB NVMe storage. The overall cost for it would be about 1000$. This would be easily able to handle every forum post ever created and every forum post created in the future for about 5 years.

mybearworld commented 7 months ago

The overall cost for it would be about 1000$.

That might be a bit of a problem.

NotFenixio commented 7 months ago

Replying to dynamixbot...

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

Voyager should be kept as a different project, developed simultaneously with Snazzle if you are planning to make it as an alternative to ScratchDB

Voyager is a separate project.

redstone-dev commented 7 months ago

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

I'll run it on my Pi 4B once I get it upgraded. I'm gonna get a super fast, high capacity SSD for it and a better fan so it can't overheat. I might also get a Pi 5 to run it in a cluster with and have them be able to access the same storage but that is incredibly tentative at the moment.

Also, to address the more alarming thing: I apologize for ending the project (it's not now lol) without any warning to team members. I should have asked you about it before making an executive decision and announcement. Going forward I will make contact with all of you before making any drastic decisions.

dynamixbot commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

davidtheplatform commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

I can set up a raspberry pi as a server

dynamixbot commented 7 months ago

The overall cost for it would be about 1000$.

That might be a bit of a problem.

I mean ScratchDB costs about 100$ every month to maintain (estimated figure)

dynamixbot commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

I can set up a raspberry pi as a server

Do you have a Compute Module 4? That way, I can design a PCB in <1 month and the PCB would only cost about 5$. It would be suited to our needs.

davidtheplatform commented 7 months ago

The overall cost for it would be about 1000$.

That might be a bit of a problem.

I mean ScratchDB costs about 100$ every month to maintain (estimated figure)

What is that 100$ coming from? If it’s the cost of the hardware it’s realistic but that’s a one time cost. Internet costs could be that high (I have no idea how much day scratchdb serves)

davidtheplatform commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

I can set up a raspberry pi as a server

Do you have a Compute Module 4? That way, I can design a PCB in <1 month and the PCB would only cost about 5$. It would be suited to our needs.

I have a normal pi 4 which is the same but with more IO And how would you get the pcb to me

EngineerRunner commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

I can set up a raspberry pi as a server

Do you have a Compute Module 4? That way, I can design a PCB in <1 month and the PCB would only cost about 5$. It would be suited to our needs.

who even are you and why do you care about this? plus, we don't need a custom-made PCB for a small project when a Pi 4 would work just as well.

EngineerRunner commented 7 months ago

also, nobody seemed to mention that Voyager (iirc) is designed to be deployed locally. even if we were to have a public instance, i have a much better idea than spending $1k and $100 a month:

somebody uses a Pi that they already have, and they buy 1 or 2 usb hard drives. it's almost like it's an extremely obvious solution and doesn't require custom PCBs and shit, and would be very cheap.

dynamixbot commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

I can set up a raspberry pi as a server

Do you have a Compute Module 4? That way, I can design a PCB in <1 month and the PCB would only cost about 5$. It would be suited to our needs.

who even are you and why do you care about this? plus, we don't need a custom-made PCB for a small project when a Pi 4 would work just as well.

I am dynamixbot, dynamicsofscratch on scratch and I care about this because ScratchDB is a hassle, Snazzle looks really cool and I have some cool ideas.

dynamixbot commented 7 months ago

I mean it's not a bad idea but we don't have such money. Why would we buy a 300$ carrier for such small project? A 3D printed case is like 15$. Also, the technology is not ideal. I don't think anyone here knows Go.

I guess we could scale in the future when required. Also, I know GO and I can make a Pioneer prototype. Only Horizons needs to be handled with Google Cloud free tier.

I know I said on the snazzle topic that voyager would be canceled but if we implement it the way @dynamixbot said, it would be exactly the same as what I was trying to do :P so voyager is un-canceled now

Okay, so are we doing it or not? We can make a PCB which carries the Compute Module 4. We can then in the eventual future scale up when we start lagging. We should do this with Raspberry Pi only as it is convenient and cheap.

I can set up a raspberry pi as a server

Do you have a Compute Module 4? That way, I can design a PCB in <1 month and the PCB would only cost about 5$. It would be suited to our needs.

I have a normal pi 4 which is the same but with more IO And how would you get the pcb to me

I guess we can go with the default Pi 4

dynamixbot commented 7 months ago

The overall cost for it would be about 1000$.

That might be a bit of a problem.

I mean ScratchDB costs about 100$ every month to maintain (estimated figure)

What is that 100$ coming from? If it’s the cost of the hardware it’s realistic but that’s a one time cost. Internet costs could be that high (I have no idea how much day scratchdb serves)

Internet costs and traffic costs.

EngineerRunner commented 7 months ago

The overall cost for it would be about 1000$.

That might be a bit of a problem.

I mean ScratchDB costs about 100$ every month to maintain (estimated figure)

What is that 100$ coming from? If it’s the cost of the hardware it’s realistic but that’s a one time cost. Internet costs could be that high (I have no idea how much day scratchdb serves)

Internet costs and traffic costs.

yeah, but it's not like any of us don't pay for internet already.

dynamixbot commented 7 months ago

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

EngineerRunner commented 7 months ago

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

ajskateboarder commented 7 months ago

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

dynamixbot commented 7 months ago

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Well at least we can use it for Snazzle's projects and profiles and players and stuff.

dynamixbot commented 7 months ago

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

Hey also how many bots would we need to scrape the forums?

Like equal to how many users active on the forums?

davidtheplatform commented 7 months ago

What if instead of scraping we just use the ScratchAPI? It is already documented by the wiki and can be used to everything that can be done already on Scratch. We just have to focus on getting the extra features we want to be ready.

there's no forums API. that's the entire point of ScratchDB, and now Voyager.

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

Hey also how many bots would we need to scrape the forums?

Like equal to how many users active on the forums?

How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything

ajskateboarder commented 7 months ago

Also, fetching and parsing forum posts on demand is likely less optimal and more taxing on Scratch's servers than if Voyager scraped the forums using a few indexers

Hey also how many bots would we need to scrape the forums? Like equal to how many users active on the forums?

How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything

I don't think the # of bots refers to number of accounts being used, just the number of scraping processes running in parallel

dynamixbot commented 7 months ago

Hey also how many bots would we need to scrape the forums? Like equal to how many users active on the forums?

How fast do you want it to be? Also the forums aren’t session based so # of bots doesn’t really mean anything

Oh okay.

I don't think the # of bots refers to number of accounts being used, just the number of scraping processes running in parallel

Well I meant opposite of what you don't think. I was thinking that instead of loading everything and downloading it on cloud, we would only download it if needed or requested to be loaded. And one a page is loaded, other people don't have to go through the slow first view of a forum.

NotFenixio commented 7 months ago

I've deleted the Voyager repository available at my profile in favor of the new organization for Voyager, GetVoyager. The new Voyager version will be developed in the Voyager repository. By the way, should be have 3 separate repositories for Pioneer, Horizons, and the actual service?

dynamixbot commented 7 months ago

I've deleted the Voyager repository available at my profile in favor of the new organization for Voyager, GetVoyager. The new Voyager version will be developed in the Voyager repository. By the way, should be have 3 separate repositories for Pioneer, Horizons, and the actual service?

Subdirectories would be better than opening a whole new repository.

dynamixbot commented 7 months ago

need people for voyager

dynamixbot commented 6 months ago

@redstone-dev need people for voyager

redstone-dev commented 6 months ago

@redstone-dev need people for voyager

I think we could all work on Voyager and Snazzle at the same time, though I think you and @NotFenixio should decide on that, since you're basically the heads of the project.

NotFenixio commented 6 months ago

LGTM.

Mrdev88 commented 6 months ago

This idea is very good, I'll support this

SnarpleDev / Snazzle