Themis3000 / AO3-search-scraper

scrapes archiveofourown.org for fanfics given any search term
MIT License
6 stars 1 forks source link

How to install #1

Closed bloodconfetti closed 2 years ago

bloodconfetti commented 2 years ago

Sorry, but how do I install this?

Themis3000 commented 2 years ago

You need python 3.6+ installed (it might work with older versions, but that's untested), then use pip install requests and pip install beautifulsoup4.

Then download the repository and the instructions should work from there. I haven't visited this project in a while though so let me know if that works for you or if I'm missing a step or anything.

There is no proper "installing" really, it's just a script that needs the correct environment in order to run.

bloodconfetti commented 2 years ago

Thank you so, so much! I appreciate your response. You may not want to assist further, as I'm really slow when it comes to tech stuff, but how exactly do I run main.py once it's downloaded and extracted from the ZIP file? (I have already pasted the search query link into the page.txt)

When I double-click main.py it just opens a black window, for clarification. Is there a command I'm supposed to use in Command Prompt?

Themis3000 commented 2 years ago

The window is probably immediately closing due to an error. I always hate when that happens because the window just closes before you get to read what it says. If we run the script through command prompt/power shell it won't close the window after it crashes so we can see what it actually says.

The way I usually run stuff is through power shell because it's easier to get it open in the correct directory. Open the folder that has the main.py file in it, then hold shift + right click. In the right click menu there should be a "open powershell window here" option, click on that. Then type python .\main.py and that should run it. Let me know what the error says if there is one

bloodconfetti commented 2 years ago

The powershell way worked great! (I think haha) It says it's scraping pages - up to page 6 and has paused as per the limits I'm sure. Now I'm just not sure where ./downloads is located, for the PDFs.

Will python automatically start sending the requests for download again after the limit time on AO3 passes? Will it automatically skip over fics it's already downloaded if I close it and restart it at a later time?

Only answer what you feel up to. I understand you're probably very busy. Thanks for the input so far ^_^

Themis3000 commented 2 years ago

I just downloaded and tested the script myself, it looks like you need to create a downloads folder in the same file of as the main.py file otherwise they don't save correctly. You should create that file, paste the original link back into the page.txt file, and start the script over again to start it back over.

I didn't think anyone would find this repository and try to use it so I never really left any good instructions, sorry about that! I'll update the instructions to make it more clear for people on how to use this.

Will python automatically start sending the requests for download again after the limit time on AO3 passes? Will it automatically skip over fics it's already downloaded if I close it and restart it at a later time?

it will! It takes some time though, I think around 5 minutes. If the script is ever just doing nothing it's just waiting a 5 minute period before downloading more.

It will skip over fics it already downloaded if you close it and start it at another time! The page.txt file actually stores the link to the page it's currently on, so for example if input a link that goes to page 1 of a search for "rabbit" and then scrape 6 pages and open page.txt again the link will be different and lead to page 6 of the search for "rabbit".

I'm pretty not busy actually, quite a bit of time on my hands. I'm happy to help! Let me know if you have any issues or questions about the script in the future too!

bloodconfetti commented 2 years ago

Wow... now that I have the downloads folder set up and can see it working - this is the dream. I've been looking for this forever! I've had to spend a good chunk of money in the past (money I didn't really have tbh) to get worse results than this. Thank you so so much for creating this and - just truly I actually don't know how to thank you enough.

Themis3000 commented 2 years ago

No problem! Web scraping is one of my favorite things to do so I thought as long as I'm making stuff like this in my free time I might as well share it. I remember being proud of this project because I parallelized downloading so stuff would download much quicker then one at a time. There's still improvements in speed that could be made I remember, but I got it pretty damn close to how fast it can be.

I'm glad someone else is getting use out of it! I didn't even have a purpose for the script when I created it haha, I kinda just downloaded a couple whole fandoms worth of fanfics and published all of them to archive.org in case someone needed them in the future and then never used the script again.

If you're interested in more ao3 related stuff I'm launching an extension soon that automatically backs up fanfics you're reading, I'm just waiting for chrome and firefox to approve the extension for it to be published. I spent all night last night making it haha.

Web scraping is one of my favorite things, so if you ever have any questions or want help or anything with anything web scraping feel free to ask me.

Feel free to email me any time at mail@themimegas.com. I tend to have a lot of free time these days, so don't feel bad about it.

bloodconfetti commented 2 years ago

Man, you're amazing. That's so thoughtful of you to share, just in case. This really made my day, maybe year! Most definitely year. I'll be so happy when I can look back on these fics a decade from now. With how fast the internet deteriorates I consider what you've done a humanitarian effort. Definitely something to be proud of :)

Have you shared this on Reddit at r/DataHoarder/? Some people have posted about AO3 scraping in the past and there were comments of interest, etc.

That extension sounds great! I think I've actually heard some mumblings about that? Or something akin to it anyways. It sounds so useful, and great for the community at large! I just wish extensions were safer privacy-wise but they couldn't work, then so lol. I hope you get the approval soon. It'll be such a step forward.

If you want I have an acquaintance who created something for LJ community scraping, and they do work collecting fanvids and stuff. If you're ever interested in working on either of those at all, I can find links for you. And if you ever get a discord for archiving/scraping stuff like this, let me know. - I was gonna link you to my stuff on archive.org but it says I don't have any uploads so lol... Do you share your archive.org link? (Totally get it if you keep it private because of the people who are against archiving. I have to be wary as well. Fandom is a scary place.)

Anyways thank you so much! I'm sorry I can't afford to donate right now, but if I get into a better place I may email you to see if you have a ko-fi or something set up.

Themis3000 commented 2 years ago

I haven't shared this anywhere but here yet, although I am active on r/DataHoarder (only as a lurker, I don't post or anything). If you wish to you can share this where ever you want though.

Yeah, extensions kinda require a little bit of trust behind them kinda like installing a program off the internet in many ways. Thankfully they're at least sand boxed so they can't just do anything they want like a program can without you granting it the permissions to do so. As I'm writing this I actually did just get approval for the firefox extension, so that should probably be published just now.

I think I've actually heard some mumblings about that?

Well I conceived the idea yesterday, learned how to make extensions, and made the whole thing last night so if you've heard any mumbling about something similar someone else is probably just working on a similar idea. You're actually the first person who I've told about it.

The only flaw if it is that you can't really store files locally through an extension. I mean, it's possible but they become only accessible through the extension and there's limits and things that make it difficult. So instead, there's a server I'm running that will store all the backups. It's an ideal solution for someone who just forgets to download copies of stuff and one day sees their fanfic gone or something but it still on the end depends on me running my servers. The server is extremely cheap so I plan to run it for a very very long time and back everything up properly though. I'm thinking of trying to get the extension to also store the backups locally in the future though.

The extension collects and retains no information about requests made at all, but that's not really something provable so you'll have to just take my word for it if you are interested in it but have privacy concerns. If you're looking for cold local backups it's (at least not currently) the solution for that anyways.

I don't have much on archive.org at all, but here's my link if you're curious: https://archive.org/details/@themi_megas

I'd be interested in being involved with anything involving web scraping! Do feel free to send links over to that sort of stuff.

Don't worry about donating or anything. I don't even have any system set up to accept donations at all anyways.

If you'd like to add me on discord I'm Cat Meow Meow#7380 also. I'm not currently in any servers about scraping. I used to be active in the return youtube dislikes server though because I was one of the people helping archive dislike counts, but ever since the dislikes where removed from the youtube api there aren't discussions about scraping any longer.

bloodconfetti commented 2 years ago

I'm a bit reddit shy atm but I may very well share it. It's gotten a bit mundane over there. All about external hard drives. Which is useful but still...

Wow, you are such a fast worker! Well yes I guess it must've been someone else working on something similar. No idea how or where they plan on storing it, or in what format. It's extra kind of you to donate server storage space! Really glad it's affordable for you. Even if locally stored files can only be accessible through the extension, I'm sure people would find it useful. Maybe even moreso. I don't think a ton of people actually care to dl things to their pc. They could easily do so with the PDF buttons if they wanted. I think your extension will provide something for the more mainstream/slightly more casual fan. Which I think means you're addressing a huge majority of people's needs.

I'm dling The Owl House fics. That's awesome! Such a good show that shouldn't have been cancelled.

Here's the LJ Archiver: https://vidder.gumroad.com/l/ljarchivr It works via Excel. That's VidderKidder's specialty. He also had one for 8Tracks. He has a community built for vidders to upload to as well as the scraping, but if you contact them and want to share any fanvids you've scraped before, I'm sure they will be very welcome. https://vidders.net/ Vidder keeps them on a cloud drive I believe.

I'll go add you on Discord in case either of us ever find a server that seems it'd be of interest. Thanks so much for your work with the dislikes. I was a part of the 8tracks saving, and Yahoo Groups myself. Though 8tracks ended up being purchased and kept, and Yahoo Groups never actually deleted I think? But still lol.