Open ghost opened 1 year ago
#include sorry_for_slow_response.h
Hi!
Can you please explain how you specify node's address?
Unless there's some special networking wizardry on that cluster, you should be able to specify 0.0.0.0
instead of your ip address in --host_maddrs
and it should work normally.
While we figure this out, here's a quick workaround that should work on most machines:
export IPV4=$(dig -4 TXT +short o-o.myaddr.l.google.com @ns1.google.com | tr -d '"')
# or: export IPV6=$(dig -6 TXT +short o-o.myaddr.l.google.com @ns1.google.com | tr -d '"')
# if you do not have an ipv4 / v6 address, it will be
# test
echo "run_stuff --host_maddrs /ip4/$IPV4/tcp/1337"
@Vahe1994 is also working on an automatic relaying script to make this even easier to set up, will keep you updated in this issue.
Hello, thanks for the reply! I was wondering, what would be the advantage of running a private petals network instead of a torch distributed or huggingface accelerate run? Sorry if the question seems very basic to you.
Hi! If you have a swarm where all nodes have the same GPU / network specs and are 100% reliable - you should prefer torch.distributed -- or even deepspeed.inference.
If your GPUs are preemptible - e.g. sometimes other people wanna use it and you need to shut down some of the nodes, Petals can handle that, while torch.distributed would require a lot of extra effort.
One small addition to @justheuristic's response: as far as I know, neither torch.distributed nor DS-Inference provide you with a full-fledged setup for running a model inference server, only the building blocks for parallelism and various inference optimizations. That's fine if you want to implement the actual server yourself, but if you need a complete solution for exposing models to external requests, you'd be better off with something like Triton (or Petals!)
Having to specifically hard code IP adresses makes it very hard to run petals on a SLURM cluster. There I submit batch jobs that are then run on some node of the partition I specified. I do not know the IP beforehand of the node or any nodes that I run a petals server instance on.
So one thing that would be helpful is a "self discovery" of petals server instances inside a specified network.