fake-name / xA-Scraper

69 stars 8 forks source link

Odd issue with FA scraper cookie storage #98

Open mmsterful opened 4 years ago

mmsterful commented 4 years ago

I'm trying to run an FA scrape, after doing a git pull (and subsequently re-making my settings.py to get it working again), and getting this:

(venv)  ✘ username@Monolith  /mnt/c/Users/username/xA-Scraper   master ●  python3 -m manage fetch fa
Setting up loggers....
done
Setup
initialized manager
fetch args ['fa'] <class 'list'>
ScraperBase Init
Starting up
Main.WebRequest - INFO - Using global chromium tab pool
Starting up?
Creating pool
INFO: Creating engine for process! Engine name: 'MainProcess-MainThread'
Main.WebRequest - INFO - Fetching content at URL: http://www.furaffinity.net/controls/user-settings/
Main.WebRequest - INFO - Request for URL: http://www.furaffinity.net/controls/user-settings/ succeeded at Wed Jul 15 00:05:46 2020 On Attempt 1. Recieving...
Main.WebRequest - INFO - URL fully retrieved.
Main.WebRequest - INFO - Compression type = gzip. Content Size compressed = 7.966K. Decompressed = 27.251K. File type: text/html; charset=UTF-8.
Main.FaGet.StatusMgr - WARNING - Not logged in!
Main.FaGet.StatusMgr - INFO - Do not have login cookie. Retreiving one now.
Main.WebRequest - INFO - Fetching content at URL: http://www.furaffinity.net/controls/user-settings/
Main.WebRequest - INFO - Request for URL: http://www.furaffinity.net/controls/user-settings/ succeeded at Wed Jul 15 00:05:46 2020 On Attempt 1. Recieving...
Main.WebRequest - INFO - URL fully retrieved.
Main.WebRequest - INFO - Compression type = gzip. Content Size compressed = 7.966K. Decompressed = 27.251K. File type: text/html; charset=UTF-8.
Main.FaGet.StatusMgr - WARNING - Not logged in!
Main.FaGet.StatusMgr - ERROR - No captcha solver configured (or no solver with a non-zero balance)! Cannot continue!
Main - CRITICAL - Uncaught exception!
Main - CRITICAL - Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/username/xA-Scraper/manage/__main__.py", line 112, in <module>
    go()
  File "/mnt/c/Users/username/xA-Scraper/manage/__main__.py", line 105, in go
    two_arg_go(sys.argv[1], sys.argv[2])
  File "/mnt/c/Users/username/xA-Scraper/manage/__main__.py", line 52, in two_arg_go
    scrape_manage.do_fetch([param])
  File "/mnt/c/Users/username/xA-Scraper/manage/scrape_manage.py", line 73, in do_fetch
    do_plugin(plgname)
  File "/mnt/c/Users/username/xA-Scraper/manage/scrape_manage.py", line 47, in do_plugin
    plg.runScraper(namespace)
  File "/mnt/c/Users/username/xA-Scraper/xascraper/modules/scraper_base.py", line 831, in runScraper
    instance.go(ctrlNamespace=managedNamespace)
  File "/mnt/c/Users/username/xA-Scraper/xascraper/modules/scraper_base.py", line 791, in go
    cookieStatus, msg = self.getCookie()
ValueError: too many values to unpack (expected 2)

Unfortunately, my skills aren't good enough for me to to figure out what exactly is going on here with WebRequest and the cookies file. It's a valid LWP file; I tried updating it with the "a" and "b" cookies to no avail. The "manual FA login" option on the web interface seems to no longer be functional; it looks like they removed the old secondary captcha.

Manually bypassing the cookie check by making it return True lets it scrape, but it reported possible missing art with 946 expected and 624 retrieved from the first artist, so I don't think it's logged in.

fake-name commented 4 years ago

The critical part is:

Main.FaGet.StatusMgr - ERROR - No captcha solver configured (or no solver with a non-zero balance)! Cannot continue!

The exception is a bug in the cookie failure return value. Basically, the failure handling had a bug, but since the failure isn't recoverable it wound up not causing any additional problems.

FA cannot log in automatically, since they use a captcha. You have to use the web interface to solve the captcha yourself (or use a captcha solving service).

The manual login stuff is kind of creaky, I strongly suggest using a service (I like anti-captcha.com). They're a bit of an affair to get money into, but $5-10 of credit should last nearly forever.


and subsequently re-making my settings.py to get it working again

Wow, how long since you last pulled? I don't think I've changed the settings file recently.

mmsterful commented 4 years ago

It's been quite some time, before the repo went down for a while. It may have been my fault that it stopped working.

That last commit fixes this problem, but now it's throwing an exception:

Main.FaGet.StatusMgr - INFO - Login attempt status = False (Login Failed).
Main - CRITICAL - Uncaught exception!
Main - CRITICAL - Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/__main__.py", line 112, in <module>
    go()
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/__main__.py", line 105, in go
    two_arg_go(sys.argv[1], sys.argv[2])
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/__main__.py", line 52, in two_arg_go
    scrape_manage.do_fetch([param])
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/scrape_manage.py", line 73, in do_fetch
    do_plugin(plgname)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/scrape_manage.py", line 47, in do_plugin
    plg.runScraper(namespace)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/xascraper/modules/scraper_base.py", line 831, in runScraper
    instance.go(ctrlNamespace=managedNamespace)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/xascraper/modules/scraper_base.py", line 793, in go
    assert cookieStatus, "Login failed! Cannot continue!"
AssertionError: Login failed! Cannot continue!

It's odd it doesn't think it has a valid cookie; the cookies.lwp file in the project base directory is valid and should allow it to log in. (Unless it's storing the actual values somewhere else I didn't know about.)

fake-name commented 4 years ago

The cookies file being valid just means that the scraper exited correctly the last time it executed. Whether the relevant cookie in particular is in the cookies file is the issue, and in this case apparently it's not.

I just tested, and it appears the captcha handling is currently broken. It probably stopped working when FA did their site redesign, and I missed this fact because I had a valid auth cookie when doing the tests (derp).

Additionally, the auth procedure now appears to require a google reCAPTCHA, so I think I'll not be able to support the manual circumvention when I fix the problem.

fake-name commented 4 years ago

Sidenote: DA is also broken ATM. I haven't had time to poke things recently.

mmsterful commented 4 years ago

Oh, what I meant was - I logged in on a browser and transplanted the cookie info there into xA-Scraper's cookies file. It worked the last time I tried it, whenever that was.

fake-name commented 4 years ago

Ah. Well, you need two cookies, a and b. Did you get both?

mmsterful commented 4 years ago

Yes, plus the __cfduid one.

fake-name commented 4 years ago

That's strange. It should at least pass the login check if you do that.

This login check was written way, way long ago before I was just looking at cookies, rather then querying the website and checking if I can find your username on a home page path.

mmsterful commented 4 years ago

I took another look at this, since my FA scraper still doesn't work. It looks like line 41 of faScrape.py is loading http://www.furaffinity.net/controls/user-settings/ and looking for: <a id="my-username" class="top-heading hideonmobile" href=" but the string as present in the page when logged in is: <a id="my-username" href="

Changing the string makes it run the scrape without complaining, but it's indicating "artist seems to have disabled their account" for a lot of accounts that exist. I'm not sure whether it's actually logged in using the valid cookies in cookies.lwp or not.

My FA account is set to use the classic theme, so if all the screen-scraping is made with the old theme in mind, it might be that it's not actually logged in and is trying to scrape pages with the modern theme. I'm not sure.

The captcha-handling stuff for FA can probably be removed, as FA no longer appears to use a captcha.

Edit: I figured out how to turn on debug logging; it seems to be using the cookies correctly, but gives no indication why it's raising an AccountDisabledException.

Edit again: i added a log statement; it looks like maybe the submission count extraction code at line 285 in faScrape.py is failing, or at least the exception raise statement immediately below it is what's getting set off.