Open thohug opened 1 year ago
That's the expected behavior when mining too fast. Tor end node blocked. Last response: <Response [404]>
indicates that the respective node got blocked which is likely to happen after while. Make sure to work with a higher --wait_between_requests
.
Thanks for your quick reply. I understand that and tried different numbers. But if a circuit is built, there seems to be a problem with torpy? Or would you suggest also increasing Tor-Timeouts?
Initiating tor session 4
0it [00:00, ?it/s]Circuit built.
Start iteration 0: 2022-10-07 11:06:04.996308
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
WARNING:torpy.circuit:#8000000b circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Exception in thread RecvLoop_103.251:
Traceback (most recent call last):
File "C:\Users\...\anaconda3\envs\scrape\lib\threading.py", line 980, in _bootstrap_inner
self.run()
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 233, in run
callback(key.fileobj, mask)
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 220, in _do_recv
for cell in self._tor_socket.recv_cell_async():
File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 104, in recv_cell_async
more_data = self._socket.recv(TorCellSocket.RECV_BUFF_SIZE)
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1227, in recv
return self.read(buflen)
File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1102, in read
return self._sslobj.read(len)
ConnectionAbortedError: [WinError 10053] Eine bestehende Verbindung wurde softwaregesteuert
durch den Hostcomputer abgebrochen
Torsession terminated after 600 seconds tor_timeout.
I have seen this error before and until now only on Windows. This is indeed rather a problem related to torpy/SSL than fast-instagram-scraper.
If you already made sure to have the latest torpy version installed and used a virtual env, I would recommend switching to Ubuntu or if you're under Windows use WSL as the SSL error might be cumbersome to fix. There might be some conflicting SSL libraries or other hard to identify problems.
Let us know if it worked for you!
Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.
Thanks for checking it out - I didn't get to run it on wsl either so I guess the API is the Problem...Am 09.10.2022 20:57 schrieb do-me @.***>: Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Any updates on this @do-me? Great work btw!
Thanks for asking @fmac2000 (also @thohug), there are indeed.
tl;dr: Mining is getting harder, TOR end points and even residential IPs gets blocked fast (without login), no more GET but POST-Requests needed for pagination.
Let me try to sum up the current status of the active Instagram API's. Basically there are two API's running at the moment, one is the legacy API that I originally designed fast-instagram-scraper for and then there is the new one.
Example: https://instagram.com/graphql/query/?query_hash=ac38b90f0f3981c42092016a37c59bf7&variables={"id":"1020237355","first":50,"after":"2301822988561378864"}
On every page you would receive a cursor for pagination. In this example it's 2301822988561378864
that I retrieved from the previous page and insert in the following GET-request. That's what the very first version of fast-instagram-scraper did.
The legacy API is completely unchanged. You can still query stuff if you're lucky but TOR end nodes are 99% blocked. Even residential IPs get blocked after only a few requests. So the only option here is to use commercial rotating residential IPs. If you google it you will find tons of more or less shady/working/not working services offering such. If anyone needs a good recommendation write an email as I eventually managed to find a good one.
Example: https://instagram.com/explore/locations/1020237355/?__a=1&__d=dis&max_id=<cursor>
The good thing is that the new API offers plenty of new interesting nodes in the response JSON; great for research. Also (strangely) it does not block TOR end nodes. But here comes the catch: You can fire a GET request to get the first page but if you want to paginate you cannot do it with a GET request as you must include the respective headers with a bunch of tokens (e.g. XCSRF etc.). You get these tokens only by accessing the page in a browser that can execute JS to generate them (as far as I understood).
So theoretically, if you do so, copy the tokens and wrap them in a POST request in Python you're good to go. However I am not sure at what point they are eventually blocked but probably fast.
You could also go with a commercial service as some offer those requests to be executed in a real browser (and hence request the needed tokens for the POST headers) and after do normal requests (that cost way less).
Depending on your needs there are different ways to go:
Doesn't look too bright. Still, in the coming days I will update the script to work at least for every 1st location page of the new API. If someone already did, PR's are welcome.
Hope that clarifies the current situation. Let me know if you find out anything else!
I'm reopening the issue for everyone to see.
Update 11/2023: as torpy is currently unmaintained and needs refactoring due to TOR changes from V2 to V3 fast-instagram-scraper won't work.
Any Ideas on this behaviour? There seem to be problems with torpy, have you experienced this before and any idea how to solve it?
[...]