CatsLover2006 / iOSobscuraServer

Search iOS Obscura!
3 stars 2 forks source link

Scrape the entire Internet Archive in the future #7

Open upintheairsheep opened 2 weeks ago

upintheairsheep commented 2 weeks ago

Hello, when it is 100% confirmed the internet archive’s DDOS fiasco has finished (1-2 weeks after everything goes back up with no downtime at all in at least 1-2 weeks), it should be a good thing to scrape the entire IPA archive collection on the site, and parse everything slowly. This will be worth it, as “giant archives” do not contain everything. This will take lots of time for obvious reasons, but there’s only about 44k items on the category. Do note that the Internet Archive has a built in antivirus, and proactively takes down malware shortly after it’s upload.

CatsLover2006 commented 1 week ago

I'll figure out how to resolve this when the fiasco is finished.

upintheairsheep commented 4 days ago

I think the fiasco is finished, your biggest "competition" is https://stuffed18.github.io/ipa-archive-updated/# , which seems to have partially scraped the archive.

upintheairsheep commented 4 days ago

https://github.com/stuffed18/ipa-archive-updated/blob/main/data/urls.json - here is a good starting point, just get the archive's identifiers and scrape them. After that scrape is complete, then scrape the rest of the Internet Archive

CatsLover2006 commented 2 days ago

As much as I’d like to do this, the most powerful machine I have access to recently failed on me, and the next most powerful machine stuggles with the existing list of archives. Adding to this could make the full parse take over months.

upintheairsheep commented 2 days ago

GitHub Actions exists, you could offload the process to Actions.

On Monday, November 4, 2024, CatsLover2006 @.***> wrote:

As much as I’d like to do this, the most powerful machine I have access to recently failed on me, and the next most powerful machine stuggles with the existing list of archives. Adding to this could make the full parse take over months.

— Reply to this email directly, view it on GitHub https://github.com/CatsLover2006/iOSobscuraServer/issues/7#issuecomment-2455791788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKRXHHIP7AVP5HXZBWNN3Z67OJZAVCNFSM6AAAAABQR5423GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJVG44TCNZYHA . You are receiving this because you authored the thread.Message ID: @.***>

CatsLover2006 commented 1 day ago

Actions would time out. I don’t even need to check how long the theoretical max time would be, I’d hit a bandwidth limit somewhere along the way.