Problems with Torpy? - Githubissues

thohug commented 1 year ago

Any Ideas on this behaviour? There seem to be problems with torpy, have you experienced this before and any idea how to solve it?

[...]

Initiating tor session 233
                  Circuit built.
Start iteration 0: 2022-10-06 14:40:11.079573
Tor end node blocked. Last response: <Response [404]>
0it [01:16, ?it/s]
Initiating tor session 234
                  Circuit built.
Start iteration 0: 2022-10-06 14:41:28.718957
Tor end node blocked. Last response: <Response [404]>
0it [00:07, ?it/s]
Initiating tor session 235
                  Circuit built.
Start iteration 0: 2022-10-06 14:41:37.347591
ERROR:torpy.cell_socket:_ssl.c:1112: The handshake operation timed out
ERROR:root:[ignored]
Traceback (most recent call last):
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 63, in connect
    self._socket.connect((self._router.ip, self._router.or_port))
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1343, in connect
    self._real_connect(addr, False)
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1334, in _real_connect
    self.do_handshake()
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1310, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:1112: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\utils.py", line 79, in newfn
    return func(*args, **kwargs)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 183, in newfn
    return func(*args, **kwargs)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 426, in get_descriptor
    with self._get_dir_client() as dir_client:
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 375, in _get_dir_client
    self._dir_guard, self._dir_circuit = self._create_dir_circuit(purpose='Internal dir client')
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 365, in _create_dir_circuit
    guard = TorGuard(router, purpose=purpose)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\guard.py", line 66, in __init__
    self.__tor_socket.connect()
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 69, in connect
    raise TorSocketConnectError(e)
torpy.cell_socket.TorSocketConnectError: _ssl.c:1112: The handshake operation timed out
WARNING:torpy.utils:Retry with another router...
0it [00:31, ?it/s]
'graphql'
Initiating tor session 236
                  Circuit built.
Start iteration 0: 2022-10-06 14:42:09.078514
Tor end node blocked. Last response: <Response [404]>
0it [00:06, ?it/s]
Initiating tor session 237
                  Circuit built.
Start iteration 0: 2022-10-06 14:42:16.572684
WARNING:torpy.circuit:#80000242 circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Tor end node blocked. Last response: <Response [404]>
0it [00:52, ?it/s]
Initiating tor session 238

do-me commented 1 year ago

That's the expected behavior when mining too fast. Tor end node blocked. Last response: <Response [404]> indicates that the respective node got blocked which is likely to happen after while. Make sure to work with a higher --wait_between_requests.

thohug commented 1 year ago

Thanks for your quick reply. I understand that and tried different numbers. But if a circuit is built, there seems to be a problem with torpy? Or would you suggest also increasing Tor-Timeouts?

Initiating tor session 4
0it [00:00, ?it/s]Circuit built.
Start iteration 0: 2022-10-07 11:06:04.996308
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
WARNING:torpy.circuit:#8000000b circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Exception in thread RecvLoop_103.251:
Traceback (most recent call last):
  File "C:\Users\...\anaconda3\envs\scrape\lib\threading.py", line 980, in _bootstrap_inner
    self.run()
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 233, in run
    callback(key.fileobj, mask)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 220, in _do_recv
    for cell in self._tor_socket.recv_cell_async():
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 104, in recv_cell_async
    more_data = self._socket.recv(TorCellSocket.RECV_BUFF_SIZE)
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1227, in recv
    return self.read(buflen)
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1102, in read
    return self._sslobj.read(len)
ConnectionAbortedError: [WinError 10053] Eine bestehende Verbindung wurde softwaregesteuert
durch den Hostcomputer abgebrochen
Torsession terminated after 600 seconds tor_timeout.

do-me commented 1 year ago

I have seen this error before and until now only on Windows. This is indeed rather a problem related to torpy/SSL than fast-instagram-scraper.

If you already made sure to have the latest torpy version installed and used a virtual env, I would recommend switching to Ubuntu or if you're under Windows use WSL as the SSL error might be cumbersome to fix. There might be some conflicting SSL libraries or other hard to identify problems.

Let us know if it worked for you!

do-me commented 1 year ago

Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.

thohug commented 1 year ago

Thanks for checking it out - I didn't get to run it on wsl either so I guess the API is the Problem...Am 09.10.2022 20:57 schrieb do-me @.***>: Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

fmac2000 commented 1 year ago

Any updates on this @do-me? Great work btw!

do-me commented 1 year ago

Thanks for asking @fmac2000 (also @thohug), there are indeed.

tl;dr: Mining is getting harder, TOR end points and even residential IPs gets blocked fast (without login), no more GET but POST-Requests needed for pagination.

Let me try to sum up the current status of the active Instagram API's. Basically there are two API's running at the moment, one is the legacy API that I originally designed fast-instagram-scraper for and then there is the new one.

Legacy API

Example: https://instagram.com/graphql/query/?query_hash=ac38b90f0f3981c42092016a37c59bf7&variables={"id":"1020237355","first":50,"after":"2301822988561378864"}

On every page you would receive a cursor for pagination. In this example it's 2301822988561378864 that I retrieved from the previous page and insert in the following GET-request. That's what the very first version of fast-instagram-scraper did.

The legacy API is completely unchanged. You can still query stuff if you're lucky but TOR end nodes are 99% blocked. Even residential IPs get blocked after only a few requests. So the only option here is to use commercial rotating residential IPs. If you google it you will find tons of more or less shady/working/not working services offering such. If anyone needs a good recommendation write an email as I eventually managed to find a good one.

New API

Example: https://instagram.com/explore/locations/1020237355/?__a=1&__d=dis&max_id=<cursor>

The good thing is that the new API offers plenty of new interesting nodes in the response JSON; great for research. Also (strangely) it does not block TOR end nodes. But here comes the catch: You can fire a GET request to get the first page but if you want to paginate you cannot do it with a GET request as you must include the respective headers with a bunch of tokens (e.g. XCSRF etc.). You get these tokens only by accessing the page in a browser that can execute JS to generate them (as far as I understood).

So theoretically, if you do so, copy the tokens and wrap them in a POST request in Python you're good to go. However I am not sure at what point they are eventually blocked but probably fast.

You could also go with a commercial service as some offer those requests to be executed in a real browser (and hence request the needed tokens for the POST headers) and after do normal requests (that cost way less).

Advice

Depending on your needs there are different ways to go:

Quick and simply working but costly: commercial rotating residential IPs + legacy API's GET request pagination
Free but only first page per location: fast-instagram-scraper + new API (good for "broad" mining)
Cumbersome and free: copy tokens from your browser + POST requests to new API in Python (a modified version of fast-instagram-scraper would do)
Optimized commercial version: 1st request with JS execution, following without until the tokens expire.

Future of fast-instagram-scraper

Doesn't look too bright. Still, in the coming days I will update the script to work at least for every 1st location page of the new API. If someone already did, PR's are welcome.

Hope that clarifies the current situation. Let me know if you find out anything else!

I'm reopening the issue for everyone to see.

do-me commented 8 months ago

Update 11/2023: as torpy is currently unmaintained and needs refactoring due to TOR changes from V2 to V3 fast-instagram-scraper won't work.

do-me / fast-instagram-scraper

Problems with Torpy? #4

Legacy API

New API

Advice

Future of fast-instagram-scraper