Open rvencu opened 3 years ago
Hi, I'm actually currently working on integrating this into the server update v2.2.0. From that update, there will be 3 types of workers:
However, the Stage 1 -> Stage 2
transition is something that needs working on, and may cause this update not to work. The stage one workers somehow need to host the images for the stage 2 workers to use.
So I am looking how to pass a job from one crawlingathome client to another one.
This is actually currently achievable, although rather hacky. I will probably add improved support for this on my to-do list.
Take a look here. You'd want to make a duplicate of this class, but change the __init__
function to the following:
def __init__(self, url, token, shard, start_id, end_id, shard_piece):
self.s = session() # requests.session instance
self.url = url
self.token = token
self.shard = shard
self.start_id = start_id # by default usually class np.int64
self.end_id = end_id # ^^^
self.shard_piece = shard_piece
You can find all these variables using your old client, e.g. client.url
will return the url required for this __init___
.
You can then interact using the normal functions, acting as the original worker. Please note, however, that workers have a 90 minute timeout. If this is an issue for you, perhaps you could do this so that it run the client.log("Waiting for Stage Two")
on all your workers every 30 mins until you run it on your GPU?
Hopefully this should work for the time being. Please let me know if you encounter any issues!
I am doing a variation of this here https://github.com/rvencu/crawlingathome-worker/tree/dev (this is completely untested)
I start from GPU and I want to integrate workers setup from the main script, this way I can control a swarm of them that will start jobs and transfer workload for GPU. I am transferring back the results and they complete your server communication.
I am using a file semaphore mechanism to signal when images were downloaded, and the GPU is polling the workers for completion.
This is for someone that wants to max the use of its GPU during a timeframe and also pay for the workers.
I estimate that 500 euro contribution for workers can take me as far as 5% of the entire project in some 40 days. Of course 40 days of 90% GPU workload at home is in addition to that.
When I am done with that I will use your info to create a version where workers and GPU cooperate but unfortunately for them to communicate we need to disclose their IP addresses. If not, then a centralized hub still needs to be used.
Sounds great! Let me know if you'd like me to help make a server that can create this CPU -> GPU worker communication :)
Yes please.
I envision 3 type of clients (maybe make 3 clients)
there are some considerations to make re. renaming the zip files with random names (save them in job properties) so they cannot be altered by malicious users, etc.
Definitely what I envisioned too. I believe some sort of overlay server should be best for this, as I'd like the main server to be kept as simple as possible.
Here's my idea:
This is partly why I added features that can help with this:
dump_json = cah.dump(client)
client = cah.load(**dump_json)
The server can also keep the clients alive using the client.log
function, whilst a client waits for its job to be filled by the GPU worker.
I'll try have a demo server of this, along with a new git repo up by the weekend. I can't say for certain however as we are currently in the process of moving the main crawling@home server, but it seems to be going smoothly so far 🙂
Sorry - misclick! 😅
What I have against this overlay server is that it should not be GPU bound. We need an independent overlay server. Because people will pop in and out with colab gpus and kaggle gpus at any time. they are not reliable to keep scrapers alive when they suffer their own sudden death
If you do not want CAH server to integrate this then we should write a separate server for it. But seems more logical to be a CAH server function
contributors will keep their own nodes alive. we will provide them with a specific cloud-init file that will automate everything. they just keep them running as long as they want to pay the fee. so the job of managing scrapers is not there anymore. hence my idea that CAH server already does that, registers workers and expires them, the extension will just go further to register GPU workers and expire them as well
Fair enough. The more that I think of it, it appears the central server can do all that's needed and more compared to overlay servers.
Here's what I'm thinking now:
What do you think about this idea? :)
Yes, basically this is it. The challenge would be that we have too many scrapers and too few GPUs, then the payload in the temp sharing site can expire. In such cases GPU node trying to access it should report back to CAH server to unmark the shard for stage 1
a more resilient solution for storage should minimize the potential waste of scrapers
we need a mechanism to communicate what we miss most. if we miss scrapers cause we have too many GPU then register the GOU as scraper as well, this is easy. the other way around is complicated...
Yeah. As long as it's a URL that can download a ZIP/TAR, It should be fine - so there's room for customisation there.
No promises, but I'll try to attempt to build a system that can do this :)
I was looking for file transfer sites with speed and retention. It is hard to reconcile cost, speed, disk space and reliability. Wetransfer has retired public API. Dropbox has good API and reliability and a plus plan gives 2TB of storage (or about 4000 shards completed after stage 1)
But looking at the scaling aspects there are a lot of potential problems that arise, either with Dropbox or really anyone else in case we want to use 1000s of scrapers at once
A good read here https://developers.dropbox.com/dbx-performance-guide
has API, does not require an account to upload and download, only to delete 10000 zip files in principle (though no account means every worker can upload up to 10000 files) 30 days to keep
not sure about speed capping needs testing
cons: some captcha to fill at download, or 49 USD per year for an account
Just a thought, could we set up an SFTP server for the CPU & GPU clients to communicate? And if security isn't needed we could just use FTP.
@rvencu https://tmp.ninja/ seems okay? We might want to consider emailing devs first just to ensure they have the resources ready, but theoretically, any file hosting service should work as long as we have a direct download URL the workers can use to pull the data (even gdrive has this).
@ARKseal It might be better just to integrate it directly into the main server though - also helps with monitoring workers etc.
I agree but beware of the file size. 500MB each shard as result of stage 1
untested yet but very simple, the server just need to store / handle these ids (just made up this class)
import requests
import re
class Pixeldrain():
def __init__(self, api_key=""):
session = requests.Session()
self.session = session
self.key=api_key
self.url = "https://pixeldrain.com/api/file/"
def post_file(self, file):
payload={}
files=[
('file',(file,open(file,'rb'),'application/octet-stream'))
]
headers = {}
response = self.session.post(self.url+'?file', headers=headers, data=payload, files=files)
info = response.json()
return info.id
def get_file(self, id):
headers = {}
response = self.session.get("GET", self.url+id, headers=headers)
d = response.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]
with open(fname, "wb") as file:
file.write(response.content)
I tried to upload 400MB file and it crashed. I read that requests does not handle HTTP 100 - continue that perhaps happens for such a big file
Did you stream your upload?
yes, it was streamed. I could place a monitor and look at the upload, it crashes as soon as the uploaded bytes exceed 250MB
at this time I think this website is in trouble, perhaps cannot sustain the costs of storage and it is not properly maintained...
guess we still need to look at something like dropbox for reliability
except, of course, if the new server also comes with proper storage space...
Yeah, I suppose. As long as we can have a direct download URL, it should be fine.
We could theoretically use the server, however, I think it'd be a much better solution to avoid storing it on the server to avoid slowing down the workers.
As of now we have Dropbox as our only possible storage host, and for the plus plan we need to pay $10 usd, not sure if it's cheaper for you. But then again we could use GDrive.
The thing is, the server only has 100GB of storage, and I don't really want to be congesting the server with files if it can't handle the number of requests. I'm going to look into setting up a server built specifically for this crawling@home image hosting, hopefully using a google cloud trial again.
the protocol should first attempt peer connectivity between scraper and gpu worker. in case it fails then save the files at the transfer server...
A great idea. I wonder how we could pull this off without having to forward ports?
look here https://github.com/magic-wormhole/magic-wormhole
some of the workers will be behind NAT but the scrapers no so maybe it will work easily
I tested this out: We just need the inference worker's ip.
Reciever Script `import socket import tqdm import os
SERVER_HOST = "0.0.0.0" SERVER_PORT = 5001
BUFFER_SIZE = 4096
SEPARATOR = "
s = socket.socket() s.bind((SERVER_HOST, SERVER_PORT))
s.listen(5) print(f'[*] Listening as {SERVER_HOST}:{SERVER_PORT}')
client_socket, address = s.accept() print(f'[+] {address} is connected')
received = client_socket.recv(BUFFER_SIZE).decode() filename, filesize = received.split(SEPARATOR)
filename = os.path.basename(filename) filesize = int(filesize)
progress = tqdm.tqdm(range(filesize), f'Recieving {filename}', unit='B', unit_scale=True, unit_divisor=1024) with open(filename, 'wb') as f: while True: bytes_read = client_socket.recv(BUFFER_SIZE) if not bytes_read: break f.write(bytes_read) progress.update(len(bytes_read))
client_socket.close() s.close()`
Sender Script: `import socket import tqdm import os
SEPARATOR = "
host = ""
port = 5001
filename = "test.zip"
filesize = os.path.getsize(filename)
s = socket.socket()
print(f'[+] Connecting to {host}:{port}') s.connect((host, port)) print('[+] Connected')
s.send(f'{filename}{SEPARATOR}{filesize}'.encode())
progress = tqdm.tqdm(range(filesize), f'Sending {filename}', unit='B', unit_scale=True, unit_divisor=1024)
with open(filename, 'rb') as f: while True: bytes_read = f.read(BUFFER_SIZE) if not bytes_read: break s.sendall(bytes_read) progress.update(len(bytes_read))
s.close()`
That looks great! Just a couple questions:
I do not want to expose my GPU box to the internet.
@rvencu I believe that magic-wormhole also stores your IP, allowing for the transfer. Maybe we can use Tor to provide security?
Seems that it supports Tor. Or not?
https://magic-wormhole.readthedocs.io/en/latest/tor.html#onion-servers
Hi, can you explain in summary how would I start a job on one computer and then upload the results from another? I mean transferring the downloaded dataset to a central GPU unit for filtering, and I would like to upload the final result to your server directly from the GPU unit instead of going back to the original computer.
While for a more generic approach of the 2 steps it will be required for your server to manage the half jobs and perform distribution of workload based on node types, I want to try first to deploy my own droplets for scraping and move inference to my home GPU pc. Later the code can be adapted for mass usage.
So I am looking how to pass a job from one crawlingathome client to another one.