fake-name / xA-Scraper

69 stars 8 forks source link

Patreon scraper broken #102

Closed GeneralUltra758 closed 3 years ago

GeneralUltra758 commented 3 years ago

ghetting the following when running the patreon scraper:

Checking login!
Main.PatreonGet.StatusMgr - INFO - Getting list of favourite artists.
Traceback (most recent call last):
  File "/mnt/c/Users/GeneralUltra758/xA-Scraper/xascraper/modules/patreon/patreonScrape.py", line 274, in get_api_json
    post_type='application/json'
  File "/mnt/c/Users/GeneralUltra758/xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py", line 304, in xhr_fetch
    ret = self._unpack_xhr_resp(ret)
  File "/mnt/c/Users/GeneralUltra758/xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py", line 237, in _unpack_xhr_resp
    ret[entry['name']] = self.__decode_serialized_value(entry['value'])
  File "/mnt/c/Users/GeneralUltra758/xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py", line 199, in __decode_serialized_value
    assert 'value' in value
AssertionError

investigating rn to see if i can figure out a exact cause. note: Added the following to main_scrape.py @ line 56: runScraper(scraperClass, managedNamespace) and set to only run the patreon scraper to run the scraper immediately after startup (could not see a way to do this natively. maybe a good feature to add?)

GeneralUltra758 commented 3 years ago

further debug info:

IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: source /mnt/c/Users/GeneralUltra758/xA-Scraper/venv/bin/activate
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/mnt/c/Users/GeneralUltra758/xA-Scraper/xascraper/modules/patreon/patreonScrape.py in get_api_json(self, endpoint, postData, retries)
    272                                         post_data      = postData,
--> 273                                         post_type='application/json'
    274                                 )

/mnt/c/Users/GeneralUltra758/xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py in xhr_fetch(self, url, headers, post_data, post_type)
    303                 ret = self.execute_javascript_function(js_script, [url, headers, post_data, post_type])
--> 304                 ret = self._unpack_xhr_resp(ret)
    305                 return ret

/mnt/c/Users/GeneralUltra758/xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py in _unpack_xhr_resp(self, values)
    236                         assert entry['name'] not in ret
--> 237                         ret[entry['name']] = self.__decode_serialized_value(entry['value'])
    238 

/mnt/c/Users/GeneralUltra758/xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py in __decode_serialized_value(self, value)
    198                 assert 'type' in value
--> 199                 assert 'value' in value
    200 

AssertionError: 

During handling of the above exception, another exception occurred:

UnrecoverableFailureException             Traceback (most recent call last)
/mnt/c/Users/GeneralUltra758/xA-Scraper/xascraper/modules/patreon/patreonScrape.py in getNameList(self)
    784 
--> 785                         artist_lut = self.get_artist_lut()
    786                 except Exception as e:

/mnt/c/Users/GeneralUltra758/xA-Scraper/xascraper/modules/patreon/patreonScrape.py in get_artist_lut(self)
    767         def get_artist_lut(self):
--> 768                 general_meta = self.current_user_info()
    769                 campaign_items = [item for item in general_meta['included'] if item['type'] == "campaign"]

/mnt/c/Users/GeneralUltra758/xA-Scraper/xascraper/modules/patreon/patreonScrape.py in current_user_info(self)
    345         def current_user_info(self):
--> 346                 current = self.get_api_json("/current_user?include=pledges&include=follows")
    347                 return current

/mnt/c/Users/GeneralUltra758/xA-Scraper/xascraper/modules/patreon/patreonScrape.py in get_api_json(self, endpoint, postData, retries)
    278                         traceback.print_exc()
--> 279                         raise exceptions.UnrecoverableFailureException("Wat?")
    280 

UnrecoverableFailureException: Wat?
fake-name commented 3 years ago

and set to only run the patreon scraper to run the scraper immediately after startup (could not see a way to do this natively.

python3 -m manage fetch pat?

Note: the patreon scraper currently requires a full GUI session + headed chromium to work properly. Please feel free to complain to cloudflare if this is a problem.

GeneralUltra758 commented 3 years ago

i am aware that it uses full headed chrome. there is no issue logging in from seeing the chrome window being on the patreon home successfully logged in (using vcxsrv on WSL2)

python3 -m manage fetch pat?

was not aware of that. thanks for the tip

i tried debugging it in VSCode with WSL to see exactly whats going wrong somehoe post_type='application/json' in patreon scrapers get_api_jsonfunction now throws an assertion error which it has not before when testing it successfully the other day...

fake-name commented 3 years ago

Check your dependencies are up to date (pip install --upgrade -r requirements.txt).

I just checked locally, and https://github.com/fake-name/xA-Scraper/commit/7edb46a1e8f562c3631bc12b25c5eac7cb5aa56d was blowing up do to some recent changes upstream in my libraries. If you weren't getting crashes, you have at least one out of date library.

GeneralUltra758 commented 3 years ago

i have had commented out lines 60-72 to remove the requirement for a paid anticapcha since it requires a full headed chrome anyway.

ive now checked the stacktrace again and found the assert 'value' in value that was causing the crash in line 199 of the module xA-Scraper/venv/lib/python3.7/site-packages/ChromeController/manager.py commenting this out seems to resolve the issue... but no idea what caused the issue in the first place.. reinstalling deps hasnt changed anything..

GeneralUltra758 commented 3 years ago

worthy of note: ive been changing some things around in the patreonScrape.py to implement an additional feature (for which ill make a PR later) but those changes tested OK just the other day (prior to me submitting the dependency issue which ive fixed locally).. now all of a sudden this assertion error happens for seemingly no reason

GeneralUltra758 commented 3 years ago

Correction to requirements update, pulled latest requirements.txt seems to have fixed it. no idea how i wasnt getting that error before with the last version of the requirements.txt + locally added newer webrequest version as per issue #101

fake-name commented 3 years ago

You shouldn't need to bother with https://github.com/fake-name/xA-Scraper/issues/101 at this point, as I now depend on up-to-date (0.0.78) webrequest directly.

The errors in ChromeController are weird. In general, piloting remote chrome is kind of brittle, and can be sensitive to a bunch of other system crap (did you apt-get/yum/w-e update? Or run chrome from google, and it updated itself?).

The web is actively becoming a shittier place. It's depressing.

GeneralUltra758 commented 3 years ago

i did install chrome via a .deb file.. could be that it updated itself.. reasons why i prefer js and puppeteer for web scraper stuff

fake-name commented 3 years ago

I'd prefer people to not use JS (and google to not update multiple times a day), but you do what you can.

Basically, I depend on a number of components that need to be updated in lockstep, and chrome's fixation on self-updating is problematic. If you leave it alone it can get out of date and explode messily.

reasons why i prefer js and puppeteer for web scraper stuff

Unfortunately, I find JS itself to be a thoroughly unpleasant language to actually write stuff in.