Open deepio opened 4 years ago
Hi @deepio , my apologies for the delay in response here.
cc'ing @jacobtomlinson , when you're back from the holidays could you weigh in here?
How are you creating the SOCKS proxy ? I believe you should be able to forward the remote port to a somewhat arbitrary local port
@quasiben I do dynamic application-level port forwarding using ssh. I know the tunnel works because I've also been using it to do many other things in private IP address space.
I'm not sure I 100% understand your use case.
Where is your scheduler running? How many things are you trying to expose?
I currently have 5 dask projects, with only 1 actually running.
Thanks for the diagram, that's very useful. You might want to check out dask-gateway as a potential solution for this. It handles cluster management and proxying.
I'm not sure I understand how your Dask workers access your data if they are across the internet. Or do you mean worker as in a person working?
That is the issue i'm dealing with, when the data can be made public I can just setup an entry in the nginx to route traffic from the workers. (No a worker is not an individual person.)
...
async def f():
async with Worker(ngrok_address, interface="en1") as w:
await w.finished()
asyncio.get_event_loop().run_until_complete(f())
...
async def run_main():
async with Client(ngrok_address, asynchronous=True) as client:
for filename in filelist:
future = client.submit(main, filename, pure=False)
result = await future
with open(f"{filename}.zip", "wb") as f:
for chunk in result:
f.write(chunk)
print(f"[+] Finished: {filename}")
asyncio.get_event_loop().run_until_complete(run_main())
dask-scheduler
If you still think dask-gateway is the way to go, i'll start reading that documentation.
Ok sure. So your workers are across the internet from your data and scheduler. I assume there are reasons why your workers cannot be in the same place as the data.
I'm still not sure I fully understand how your workers are accessing the data though.
@quasiben @mrocklin do you have any thoughts on this? I've never seen a setup like this, perhaps you have?
How I access the data is nothing special:
def main(filename):
...
r = requests.get(f"{public_nginx_domain_for_project}/{filename}")
with open(fullpath, "wb") as f:
for chunk in r:
f.write(chunk)
...
# Run the process on the file.
# Return a zip file with all the other files the process has generated,
# because only the scheduler+client machine has the storage space for the output results.
# Destroy all files to reset the worker and wait until the next loop starts.
When it's not public, and I have another ngrok tunnel available, I'll use a different ngrok tunnel (you can only get so many of those) to get the files.
But if dask could use proxies, I could setup a SOCK proxy and then I would only need the one ngrok tunnel for all my connections in all my projects (public or private data included.)
# This is currently implemented in requests
import requests
proxies = {"https": f"socks5://{ngrok_tunnel}"}
r = requests.get('http://example.org', proxies=proxies)
# if I could do this, it would solve all my problems.
proxies = {"https": f"socks5://{ngrok_tunnel}"}
async with Client(project_1_local_address, asynchronous=True, proxies=proxies) as c:
# And
async with Worker(project_1_local_address, interface="en1", proxies=proxies) as w:
Or is there a way to monkeypatch all dask requests to run through a sock proxy?
import urllib2
import socket
import socks
socks.set_default_proxy(socks.SOCKS5, ngrok_address)
socket.socket = socks.socksocket
# Then dask stuff
Is there a reason why you couldn't run your scheduler in the same location as your workers? Then access the data via the ngrok tunnel?
TL/DR: The big reason is that I've already reached the maximum amount of ngrok tunnels I can have total, but there are other smaller issues that also make this less ideal.
- I used to use ngrok to poke more holes but now I've reached the limit of how many ngrok tunnels I can have at the same time too.
This is still the case. ^^^ From the first post in this thread, I know it was a while ago now so it could have been easily forgotten.
The primary network on which workers are running is prone to frequent floor power outages/updates (too frequent for a 7 month, uninterrupted task). The power issue can be mitigated for the scheduler by resuming from where it left off, but the workers are not necessarily all on the same network either. In my testing i've had issues when the scheduler would go down and the workers didn't, the workers could not connect back to the scheduler when the scheduler came back online. If the scheduler goes offline, I would need to take offline all the workers and restart the workers.
I've packed the workers into Docker images, Vagrant images, VMs, etc. depending on different requirements of the project to avoid issues with installation or other. This is to have other labs help process the files faster if they can spare the computer time, but not have to worry about installing or setting up anything. It might even be possible that the lab can't connect to the scheduler without getting an ngrok tunnel. Also finally, ngrok tunnel addresses change when the process restarts, unless I pay for reserved domains.
Ok sure. Thanks for bearing with me and explaining in depth.
Your use case sounds a little unusual, but I agree that a SOCKS proxy would mitigate some of your issues. This would probably be a reasonable amount of work.
In the mean time perhaps an alternative to ngrok like serveo may be more suitable. It has fewer limitations and you can set a permanent address.
What a great service (serveo), thank you for bringing this one to my attention. It's too bad that it is currently disabled. Hopefully will see it in action in a few days though.
Dask has implemented many things I would need to program myself, I'm quite thankful that this exists! I'm not quite sure where to begin looking at the codebase but if I can help with implementing socks i'd be glad to help.
My scheduler is not reachable by the public web, I actually use a SOCKS5 proxy to reach it. The reason is, i'm limited by the number of public IPs I can have at one time. To perform my task, I'm using
dask.distributed.Client
.socks4
orsocks5
withdask.distributed.Client
context manager did not work, or maybe i'm doing something wrong?Perhaps this is a dumb question, but I did not find information about this in the documentation. https://distributed.dask.org/en/latest/local-cluster.html and https://distributed.dask.org/en/latest/client.html
Is there a way to add a socks proxy?