fake-name / xA-Scraper

69 stars 8 forks source link

TwitGet Bad Request 400 #100

Open God-damnit-all opened 4 years ago

God-damnit-all commented 4 years ago

I'm getting this problem using the old method, the one that doesn't involve a headless browser. It started up just a few hours ago. At first I thought it was IP-based, like I hit some sort of request limit, but not only did it not go away when I threw up a VPN, it seems to be errorring out on the Twitter profile page, not even getting to the step for the search API json.

This strikes me as very odd and makes me wonder if the error that is being thrown by xA-Scraper is even accurate.

Main.TwitGet.StatusMgr - INFO - GetArtist - veyopixel (ID: 437)
Main.WebRequest - INFO - Fetching content at URL: https://twitter.com/veyopixel
Main.WebRequest - INFO - Have additional GET parameters!
Main.WebRequest - INFO -        Item: 'Accept' -> 'application/json, text/javascript, */*; q=0.01'
Main.WebRequest - INFO -        Item: 'Referer' -> 'https://twitter.com/veyopixel'
Main.WebRequest - INFO -        Item: 'User-Agent' -> 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'
Main.WebRequest - INFO -        Item: 'X-Twitter-Active-User' -> 'yes'
Main.WebRequest - INFO -        Item: 'X-Requested-With' -> 'XMLHttpRequest'
Main.WebRequest - INFO -        Item: 'Accept-Language' -> 'en-US'
Main.WebRequest - WARNING - Error opening page: https://twitter.com/veyopixel at Tue Aug 11 18:30:39 2020 On Attempt 1.
Main.WebRequest - WARNING - Error Code: HTTP Error 400: Bad Request
Main.WebRequest - WARNING - Original URL: https://twitter.com/veyopixel
Main.WebRequest - INFO - Have additional GET parameters!
Main.WebRequest - INFO -        Item: 'Accept' -> 'application/json, text/javascript, */*; q=0.01'
Main.WebRequest - INFO -        Item: 'Referer' -> 'https://twitter.com/veyopixel'
Main.WebRequest - INFO -        Item: 'User-Agent' -> 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'
Main.WebRequest - INFO -        Item: 'X-Twitter-Active-User' -> 'yes'
Main.WebRequest - INFO -        Item: 'X-Requested-With' -> 'XMLHttpRequest'
Main.WebRequest - INFO -        Item: 'Accept-Language' -> 'en-US'
Main.WebRequest - ERROR - Failed to retrieve Website : https://twitter.com/veyopixel at Tue Aug 11 18:30:53 2020 All Attempts Exhausted
Main.WebRequest - CRITICAL - Critical Failure to retrieve page! https://twitter.com/veyopixel at Tue Aug 11 18:30:53 2020, attempt 2
Main.WebRequest - CRITICAL - Error:
Main.WebRequest - CRITICAL - Exiting
Main.TwitGet.StatusMgr - ERROR - Traceback (most recent call last):
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\twitScrape.py", line 256, in go
Main.TwitGet.StatusMgr - ERROR -     errored |= self.getArtist(aid=aid, artist=name, ctrlNamespace=ctrlNamespace)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\twitScrape.py", line 206, in getArtist
Main.TwitGet.StatusMgr - ERROR -     for tweet in intf.get_all_tweets(artist, min_date):
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 281, in get_all_tweets
Main.TwitGet.StatusMgr - ERROR -     interval_start = self.get_joined_date(username, twit_headers)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 149, in get_joined_date
Main.TwitGet.StatusMgr - ERROR -     ctnt = self.stateful_get("https://twitter.com/{user}".format(user=user), headers=twit_headers)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 22, in stateful_get
Main.TwitGet.StatusMgr - ERROR -     return self.__stateful_get("getpage", url, headers, params)
Main.TwitGet.StatusMgr - ERROR -   File "D:\xA-Scraper\xascraper\modules\twit\vendored_twitter_scrape.py", line 54, in __stateful_get
Main.TwitGet.StatusMgr - ERROR -     page = func(url, addlHeaders=headers)
Main.TwitGet.StatusMgr - ERROR -   File "C:\Python38\lib\site-packages\WebRequest\WebRequestClass.py", line 195, in getpage
Main.TwitGet.StatusMgr - ERROR -     return self._unwaf_func("_getpage", requestedUrl, *args, **kwargs)
Main.TwitGet.StatusMgr - ERROR -   File "C:\Python38\lib\site-packages\WebRequest\WebRequestClass.py", line 160, in _unwaf_func
Main.TwitGet.StatusMgr - ERROR -     return target_func(requestedUrl, *args, **kwargs)
Main.TwitGet.StatusMgr - ERROR -   File "C:\Python38\lib\site-packages\WebRequest\WebRequestClass.py", line 658, in _getpage
Main.TwitGet.StatusMgr - ERROR -     raise Exceptions.FetchFailureError("Failed to retreive page", requestedUrl,
Main.TwitGet.StatusMgr - ERROR - WebRequest.Exceptions.FetchFailureError: <FetchFailureError 400 -> 'Bad Request' for url: https://twitter.com/veyopixel ({b''})>
Main.TwitGet.StatusMgr - ERROR -
God-damnit-all commented 4 years ago

Shit. I just realized what the problem is... or at least I think this is what's wrong. They fucked with the joined date, again.

image

The relevant code, which is now broken...

    def get_joined_date(self, user, twit_headers):

        ctnt = self.stateful_get("https://twitter.com/{user}".format(user=user), headers=twit_headers)
        html = HTML(html=ctnt)
        joined_items = html.find(".ProfileHeaderCard-joinDateText")
        if not joined_items:
            raise exceptions.AccountDisabledException("Could not retreive artist joined date. "
                "This usually means the account has been disabled!")

        assert len(joined_items) == 1, "Too many joined items?"
        joined = joined_items[0]

        posttime = dateparser.parse(joined.attrs['title'])

        self.log.info("User %s joined twitter at %s", user, posttime)

        return posttime

.ProfileHeaderCard-joinDateText no longer exists, and now one would have to lookup the text within div[data-testid="UserProfileHeader_Items"] > span, but I'm not entirely sure how to lookup attributes other than class and id with this Python library.

I don't understand why this is throwing '400 Bad Request' instead of 'Could not retreive artist joined date.', however. Either more than one thing is wrong, or it's just not tripping if not joined_items for some reason.

fake-name commented 4 years ago

Dammit, I hate minified/obsfucated CSS.

fake-name commented 4 years ago

The reason you're seeing the 400 error is probably because they added more UA/header sniffing, which is catching that WebRequest isn't acting exactly like a browser.

More and more I'm considering trying to create a library around either the firefox or chromium HTTP(s) client code.

God-damnit-all commented 4 years ago

Not sniffing as it turns out, but now it's actually directly checking to see if JavaScript was loaded and blocking you if not.

<script nonce="ZmY4Y2NjZGUtNjZkMi00ZTY4LWIyZWEtMWE0ZDM1YmE2MDg4">
  if (!window.__SCRIPTS_LOADED__['main']) {
    document.getElementById('ScriptLoadFailure').style.display = 'block';
  }
</script>
God-damnit-all commented 4 years ago

This does look like something I'd have to jump to your headless browser method to, though I have a feeling even if I did, updates would be necessary since the way to obtain the join date changed so drastically.

God-damnit-all commented 4 years ago

Temporary workaround.

    def get_joined_date(self, user, twit_headers):

        ctnt = self.stateful_get("https://nitter.net/{user}".format(user=user), headers=twit_headers)
        html = HTML(html=ctnt)
        joined_items = html.find(".profile-joindate > span > div")
        if not joined_items:
            raise exceptions.AccountDisabledException("Could not retreive artist joined date. "
                "This usually means the account has been disabled!")

        assert len(joined_items) == 1, "Too many joined items?"
        joined = joined_items[0]

        posttime = dateparser.parse(joined.text.replace("Joined ",""))

        self.log.info("User %s joined twitter at %s", user, posttime)

        return posttime

God bless nitter.net