Open upintheairsheep opened 2 weeks ago
I'll figure out how to resolve this when the fiasco is finished.
I think the fiasco is finished, your biggest "competition" is https://stuffed18.github.io/ipa-archive-updated/# , which seems to have partially scraped the archive.
https://github.com/stuffed18/ipa-archive-updated/blob/main/data/urls.json - here is a good starting point, just get the archive's identifiers and scrape them. After that scrape is complete, then scrape the rest of the Internet Archive
As much as I’d like to do this, the most powerful machine I have access to recently failed on me, and the next most powerful machine stuggles with the existing list of archives. Adding to this could make the full parse take over months.
GitHub Actions exists, you could offload the process to Actions.
On Monday, November 4, 2024, CatsLover2006 @.***> wrote:
As much as I’d like to do this, the most powerful machine I have access to recently failed on me, and the next most powerful machine stuggles with the existing list of archives. Adding to this could make the full parse take over months.
— Reply to this email directly, view it on GitHub https://github.com/CatsLover2006/iOSobscuraServer/issues/7#issuecomment-2455791788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNKRXHHIP7AVP5HXZBWNN3Z67OJZAVCNFSM6AAAAABQR5423GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJVG44TCNZYHA . You are receiving this because you authored the thread.Message ID: @.***>
Actions would time out. I don’t even need to check how long the theoretical max time would be, I’d hit a bandwidth limit somewhere along the way.
Hello, when it is 100% confirmed the internet archive’s DDOS fiasco has finished (1-2 weeks after everything goes back up with no downtime at all in at least 1-2 weeks), it should be a good thing to scrape the entire IPA archive collection on the site, and parse everything slowly. This will be worth it, as “giant archives” do not contain everything. This will take lots of time for obvious reasons, but there’s only about 44k items on the category. Do note that the Internet Archive has a built in antivirus, and proactively takes down malware shortly after it’s upload.