Themis3000 / AO3-search-scraper

scrapes archiveofourown.org for fanfics given any search term
MIT License
6 stars 1 forks source link

Automatically halted on page 848/858 #2

Closed bloodconfetti closed 2 years ago

bloodconfetti commented 2 years ago

Hey :) Hope your day's been nice!

So I left the program running over night and all day. Within the last half hour, I would say, it came up with an instance of 'halted on page 848' so I thought oh maybe it got tired of retrying a page or something. So I changed the search query to start at page 848. Ran it again. It said 'scraped page 1' then on the next line 'halted on page 1'. Then tried the next page. Same thing. I think that means it's working as intended, technically, at least as far as the query goes.

I guess I'm wondering if you know why it may have halted prior to completion. And if there's a way to indicate for it to move through the rest of the pages if it halts prior to completion, and you want to move forward as before.

Sorry if I'm not expressing myself clearly. Anyways it's still great and it left me so few pages to work on that I obviously don't mind inputting the next page number. Just curious for future endeavors in case it were to leave more pages. And just to let you know in case you were curious about how things proceed on other's attempts. Lol sorry. Anyways, have a good evening!

Themis3000 commented 2 years ago

Hi!

The script isn't very polished in how it communicates with the user, so no matter what it'll reset the page count when you restart the script. Internally, it's still scraping the right page it just resets the count it prints out. I always get bored of making stuff accessible from the user end so I just called that a "feature telling you how many pages you've scraped in that session instead of how many you've scraped total" and wrapped up haha.

If it stopped and said "halted on page x" that means it successfully scraped the current page, but couldn't find the url for the next page. Usually that's because it's reached the end of the pages available, but in this case it sounds like it hasn't. Could you send me the link to the page it hauled on so I could test it out?

bloodconfetti commented 2 years ago

https://archiveofourown.org/tags/IT%20(Movies%20-%20Muschietti)/works?commit=Sort+and+Filter&exclude_work_search%5Barchive_warning_ids%5D%5B%5D=18&page=848&work_search%5Bcomplete%5D=&work_search%5Bcrossover%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bexcluded_tag_names%5D=&work_search%5Blanguage_id%5D=&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Bsort_column%5D=revised_at&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=

There's the link for the page it halted on :)

Mhm, absolutely no issue taken with the page number. I just was telling you that it was able to grab the fics on the next pages just fine, even though for whatever reason it halted.

Themis3000 commented 2 years ago

Thanks for sending it over, I was able to reproduce the issue and I'm trying to figure out a cause & fix for it now.

Themis3000 commented 2 years ago

Hmm, curiously visiting that link on archive of our own with js disabled it shows 848 as it's last page available. It shows there's only 16,960 works in existence for that same search query after disabling js but shows 17,142 when js is enabled. Something between having js enabled vs not is causing an issue somehow..

bloodconfetti commented 2 years ago

Oh, strange! I wonder why. That's so cool you were able to figure that out though. Who would've thought? Lol, that's really interesting.

Themis3000 commented 2 years ago

Okay, after some digging it looks like some works don't appear when not signed in only when js is disabled. You can see all works when logged in. I think this can be fixed by adding some method of providing a login token to the script, I'll work on that. It seems like to me though manually adjusting the url past 848 is actually not making the script download anything new and visiting those pages with js disabled shows a blank page, so until this is fixed you're missing about 182 works that (I think) you need to be signed into an account in order to view.

bloodconfetti commented 2 years ago

Ohhhh shoot, okay! Well, if you ever get that working, I'll be happy to try it, of course. It's good to know. To clarify, are they locked fics, then?

Themis3000 commented 2 years ago

I haven't tested it, so I can't be 100% sure but it seems most logical to me that it'd be only the locked ones. I could test if it is only the locked fics or not if you send me a link to a locked fic. I don't know of any to test it on, but I have a test in mind to see if it does affect specifically the locked fics.

bloodconfetti commented 2 years ago

https://archiveofourown.org/works?work_search%5Bsort_column%5D=revised_at&work_search%5Bother_tag_names%5D=&exclude_work_search%5Barchive_warning_ids%5D%5B%5D=18&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bquery%5D=restricted%3A+true&work_search%5Blanguage_id%5D=&commit=Sort+and+Filter&tag_id=IT+%28Movies+-+Muschietti%29

I don't think you even have to test it. There are exactly 182 fics that are restricted/locked. Sorry for asking a question I apparently easily could've checked myself! Oops

Themis3000 commented 2 years ago

Oh good eye, I didn't know you could filter by locked status at all. That confirms it!

I'm working on making the login process right now so you can optionally provide a username and password so it can scrape locked stuff too. The process should be fairly straight forward, but there's a couple extra steps I have to take because it seems they did something to intentionally make automating logging into an ao3 account more difficult.

Themis3000 commented 2 years ago

Credentials can now be supplied as described in the new readme.md file. Additionally, the script will now skip over already downloaded files. If you'd like to download all of the missing works from your first download pass, you may provide credentials and restart the script at page 1. It should skip downloading anything you already have!

bloodconfetti commented 2 years ago

That's amazing! Thank you for your hard work ^_^

I will try it shortly. I am reaching out to AO3 to ask if it would be any issue, but I doubt it. OTW is usually fairly open-minded about these things. I'll let you know, if you're interested, in their answer (on discord, so as not to bug you here.)

Themis3000 commented 2 years ago

Sounds good, let me know what their response. I'll update the readme informing if they officially say it's okay to do or not once you let me know their response.