Open DrSocket opened 1 year ago
Used a private browser window to connect to twitter.com and this worked normally.
That's not relevant. Does the search work? That is, do you get results on https://twitter.com/search?q=ukraine&f=live? The search has different restrictions than the site in general.
3.11
Not a possible output of python3 --version
snscrape 0.6.1.20230314
Not the current snscrape version.
adding the full output report:
This is not a complete debug log. It's just the exception traceback formatted formatted in a strange way. See README for instructions on obtaining a complete log.
update: Snscrape updated to latest version, python version output in correct format. debug log updated according to README instructions tested: https://twitter.com/search?q=ukraine&f=live? using proxy in private browser, not logged to any twitter account works as expected, results appearing as without proxy.
That log was not generated with the latest snscrape version.
But I just found a proxy where the search works with curl but not with latest dev snscrape, so yeah, there is indeed some kind of issue here.
I just ran the log again, using snscrape 0.6.2.20230320. with import logging; logging.basicConfig(level = logging.DEBUG)
updated the original bug report
Do you have control over the proxy server? If so, what's the proxy software and can you run snscrape from there directly for a test?
I don't have control of the proxy server and it doesn't have many options, I pretty much just have the address, port, username, password.
Ran tests like in snscrape using https://google.com, https://api.ipify.org, https://httpbin.org/ip, all work.
req = session.prepare_request(requests.Request('GET', <url>))
envset = session.merge_environment_settings(req.url, proxies, None, None, None)
resp = session.send(req, **envset).content.decode('utf-8')
also tried running a tor server locally, this is more easily reproducible. with tor running you can add:
{'https': 'socks5h://localhost:9050'}
as proxies.
Yeah, I expected as much. I'll see if I can figure out a way to reproduce it with a simple self-hosted proxy server. Else debugging is going to be tricky. Tor is useless for this purpose because of its inherent variability of circuits etc. It also doesn't give any insight into the relevant details anyway.
Do you have recommendations on self-hosted proxy for debugging? Trying mitmproxy now.
I don't have much experience with the type of proxy at play here, so no. mitmproxy is nice for other things, but I'm not sure it applies here. It establishes its own TLS connection rather than just passing through the client's (snscrape's) connection attempt, which could distort the results.
I'm getting the same error Errors: blocked (404), blocked (404), blocked (404), blocked (404)
while running snscrape with a proxy, example code:
from snscrape.modules.twitter import TwitterUserScraper
proxies = {
"http": "http://192.168.1.5:20102",
"https": "http://192.168.1.5:20102",
}
u = TwitterUserScraper("ubuntu", proxies=proxies)
for i in u.get_items():
print(i, type(i))
The proxy works fine and can be confirmed by:
import requests
proxies = {
"http": "http://192.168.1.5:20102",
"https": "http://192.168.1.5:20102",
}
resp = requests.get("https://twitter.com", proxies=proxies)
print(resp.status_code)
# 200
The proxy works fine and can be confirmed by:
Loading the Twitter homepage isn't the same as loading the HTTP endpoints, I believe. If you can load whatever snscrape can't in a web browser (with no cookies - try an incognito or private window), that's different.
Loading the Twitter homepage isn't the same as loading the HTTP endpoints, I believe. If you can load whatever snscrape can't in a web browser (with no cookies - try an incognito or private window), that's different.
I logged into the proxy server and ran the same code:
from snscrape.modules.twitter import TwitterUserScraper
u = TwitterUserScraper("ubuntu")
for i in u.get_items():
print(i, type(i))
The error didn't occur, so I think this shouldn't be a problem with the IP.
@davuses Thank you, that's helpful. What proxy server software are you using?
@davuses Thank you, that's helpful. What proxy server software are you using?
I set up the proxy server with this tool: https://github.com/v2fly/v2ray-core
Can confirm it is not an IP issue, used the same proxy for testing both TWINT and snscrape, snscrape gives the 404 errors, TWINT does not.
Sorry for posting this here, but how do you pass the Proxy-Authorization header to the command line.
I had the same problem using the TwitterProfileScraper, I made the following changes to the code in base.py
file and twitter.py
. Here it is in case it helps anyone.
I tested usage with and without a proxy on the TwitterProfileScraper in WireShark and can confirm requests are going through the proxy.
Update the base class
Find the base.py
file.
Update the __init__
constructor to take a proxy argument and assign self._proxy
this argument.
def __init__(self, *, retries=3, proxies=None, proxy=None):
self._retries = retries
self._proxies = proxies
self._proxy = proxy
self._session = requests.Session()
self._session.mount("https://", _HTTPSAdapter())
Update the _request
function to set the environment proxy variable to the proxy you provide.
environmentSettings = self._session.merge_environment_settings(req.url, proxies, None, None, None)
# ADD THE NEXT IF CONDITION
if self._proxy:
environmentSettings["proxies"] = {"http": self._proxy, "https": self._proxy}
Update the TwitterAPIScraper class
Pass the proxy to the super()__init__
call
class _TwitterAPIScraper(snscrape.base.Scraper):
def __init__(self, baseUrl, *, guestTokenManager=None, maxEmptyPages=0, proxy=None, **kwargs):
super().__init__(retries=3, proxies=None, proxy=proxy, **kwargs)
self._baseUrl = baseUrl
if guestTokenManager is None:
global _globalGuestTokenManager
if _globalGuestTokenManager is None:
_globalGuestTokenManager = GuestTokenManager()
guestTokenManager = _globalGuestTokenManager
self._guestTokenManager = guestTokenManager
self._maxEmptyPages = maxEmptyPages
self._apiHeaders = {
"User-Agent": None,
"Authorization": _API_AUTHORIZATION_HEADER,
"Referer": self._baseUrl,
"Accept-Language": "en-US,en;q=0.5",
}
adapter = _TwitterTLSAdapter()
self._session.mount("https://twitter.com", adapter)
self._session.mount("https://api.twitter.com", adapter)
self._set_random_user_agent()
Update the TwitterSearchScraper
proxy
as an optional argument in the __init__
method.Pass the proxy
to the super().__init__()
call.
class TwitterSearchScraper(_TwitterAPIScraper):
name = "twitter-search"
def __init__(self, query, *, cursor=None, top=False, maxEmptyPages=20, proxy=None, **kwargs):
if not query.strip():
raise ValueError("empty query")
# kwargs["maxEmptyPages"] = maxEmptyPages
super().__init__(
baseUrl="https://twitter.com/search?"
+ urllib.parse.urlencode({"f": "live", "lang": "en", "q": query, "src": "spelling_expansion_revert_click"}),
maxEmptyPages=maxEmptyPages,
proxy=proxy,
**kwargs,
)
self._query = query # Note: may get replaced by subclasses when using user ID resolution
self._cursor = cursor
self._top = top
Update the TwitterUserScraper
proxy
as an optional argument in the __init__
method.Pass the proxy
to the super().__init__()
call.
class TwitterUserScraper(TwitterSearchScraper):
name = "twitter-user"
def __init__(self, user, *, proxy=None, **kwargs):
self._isUserId = isinstance(user, int)
if not self._isUserId and not self.is_valid_username(user):
raise ValueError("Invalid username")
super().__init__(f"from:{user}", proxy=proxy, **kwargs)
self._user = user
self._baseUrl = (
f"https://twitter.com/{self._user}" if not self._isUserId else f"https://twitter.com/i/user/{self._user}"
)
@nadSTC None of those changes should be necessary. The proxies
kwarg is already available on all scrapers and ends up in the environmentSettings
. You just added a second, more complicated way of specifying the proxy.
Describe the bug
I'm getting a ScraperException when trying to use a proxy with snscrape on twitter.
I've tested the proxy ip is not banned by twitter by adding it to network configuration as well as just the browser. checked the browser is using the proxy ip on https://httpbin.org/ip.
Used a private browser window to connect to twitter.com and this worked normally. So it doesn't seem to be a proxy issue or a twitter blocking issue. I think it's something to do with how the python requests library is configuring the cipher when sending the request through proxy but I don't really know.
How to reproduce
Example of code reproducing the error. the proxy should work on twitter.com in a private browser window without being logged in.
Expected behaviour
The same code without proxy outputs correct response.
Operating system
macos 13.2.1
Python version: output of
python3 --version
Python 3.11.0
snscrape version: output of
snscrape --version
snscrape 0.6.2.20230320
Scraper
TwitterSearchScraper
How are you using snscrape?
import snscrape.modules.twitter as sntwitter
Log output
adding the full output report (proxy address changed for privacy reasons):