PretendoNetwork / archival-tools

A collection of tools dedicated to archiving several types of data from many Nintendo WiiU and 3DS games
40 stars 6 forks source link

Add BOSS (SpotPass) archiver #6

Closed jonbarrow closed 7 months ago

jonbarrow commented 8 months ago

Adds in a basic BOSS (SpotPass) archiver. There's no easy way to get this data, so this is very incomplete and kind of slow.

The list of Wii U tasks was generated from the logs of our BOSS server.

The list of 3DS tasks was generated from the logs of our BOSS server and from this archive of URLs.

The 3DS scraper has the potential to fail because NPFL seems to be inconsistent in giving file lists, and there's several ways to format the NPDL URLs which also seems to be inconsistent. I did my best guess for what should get us the most data.

jonbarrow commented 8 months ago

@DaniElectra So @InternalLoss brought up a good point on Discord:

if youre running out of storage locally could always run it in increments and push the files up to R2?

This is how we handled the other archivers as well, specifically Super Mario Maker. But this scraper isn't designed in a way that would allow of this. Maybe we should change the approach?

For Super Mario Maker what I ended up doing is making an SQLite DB with the possible course IDs left to check as well as a flag for whether or not the ID has been checked or not. Then each iteration would select some number of rows from the DB that have not been processed and process them in parallel, and then updates their rows as they finish. That way even between script reboots (which would happen often on the VPS due to the script crashing) it would never skip IDs.

We could maybe do something like that here? Add another script called build-database.js or something which takes in the JSON lists and makes the same kind of SQLite DB, and then rather than running it like it is now doing the same "pick some unprocessed rows and process them".

The downside to this is that unless it selects many many rows at once, it would likely be much slower. Right now the script is running every country and language combination for both consoles in parallel, and the Node runtime is managing all the 1,000+ promises happening. Doing this SQLite DB route would allow us to do things in increments, but would result in less promises running in parallel (unless we set a high default).

Thoughts?

DaniElectra commented 8 months ago

The database route sounds reasonable. I'm not sure if there are any boss tasks that are actively updating right now, but maybe we could store the last modified timestamp too for caching and check for updates later for those cases?

jonbarrow commented 7 months ago

@DaniElectra Does this look good to you btw? I think it's done personally, just needs entries into the database now