add 5 GPUs to production network

lukemarsden commented 1 year ago

splitting out from #715

lukemarsden commented 1 year ago

GCP so far have denied our requests

lukemarsden commented 1 year ago

where will we get GPUs? check ovh and hetzner, or failing that Luke's desk

lukemarsden commented 1 year ago

Phil is trying using different regions

philwinder commented 1 year ago

GCP have actually REDUCED our limits down to 2 now. I have altered the terraform code so that we have a second GPU node in Europe. That works. https://github.com/filecoin-project/bacalhau/commit/fd576f28cdfb26d3be33e774efd11de9e0714574

However, when submitting a GPU job, I can't get two GPU jobs to get scheduled at the same time.

For example, run two of these:

bacalhau docker run --gpu 1 jsacex/stable-diffusion-keras -- python stable-diffusion.py --o ./outputs

And only one will run at a time. Alternatively, run multiple of these:

docker run --gpu=1 nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

And note that the node hash is always the same QQ at the end.

So there’s an issue scheduling 2 GPU jobs. I don’t know if that node hasn’t joined the cluster, or if there’s something else going on. But I don’t have time to debug right now.

binocarlos commented 1 year ago

OK I think there is a bug with the capacity manager - having dug into the production nodes - I am sometimes seeing GPU jobs running in parallel and then sometimes sequentially...

When running sequentially - bids are being made and accepted by the node already running the job - which indicates there is something wrong with capacity manager thinking it has capacity when it clearly does not - it's not actually starting the job, but it is bidding on it (preventing the other node who actually has the capacity from winning the bid) - am writing a test now

binocarlos commented 1 year ago

yay! I've just had a breakthrough with the bug that I've been banging my head against - why the second GPU was not getting jobs scheduled to it - I was fairly sure there is a bug in the bidding & event system (because the second node should absolutely be taking that job on)

having already broken production trying to debug - I was like "OK - surely I can replicate this with a test"

so - test suite locally is testing exactly the same scenario (i.e. 2 GPU nodes and a job that takes a while - it should be one job per node) - for the life of me I couldn't reproduce.... then it struck me - network latency! the second GPU node is in a different zone (because of the adventure that Phil went on to get us more GPUs) - so - I added a config to the noop transport that introduces a network delay based on the node index - now I can make every message coming from the second node be delayed - just like in real life - nope - nothing - exactly the same results

went to bed and I think I was actually dreaming about this bug and woke up with a strong sense of "the only way I can debug this is in production again" - which was depressing (see earlier post where I broke production) OMG I've only just found out that my artificial network delay was not even triggering!!! now I've made that actually work - boom - I have a reproducer - both jobs are getting scheduled in series to the first node :party-cat:

we now have a multi-node noop stack setup (we only ever had a single node noop stack)
we can introduce arbitrary network delays on the transports on any node
this actually gives us a fast (it's noop) stack for testing distributed networks

I am now very happy and will now get the test to pass :smile:

binocarlos commented 1 year ago

When running sequentially - bids are being made and accepted by the node already running the job - which indicates there is something wrong with capacity manager thinking it has capacity when it clearly does not - it's not actually starting the job, but it is bidding on it (preventing the other node who actually has the capacity from winning the bid) - am writing a test now

This is not actually true - I think the issue here is jobs were getting randomly dropped by the CalculateJobNodeDistanceDelay - we now have an explicit test for network latency and running GPU jobs in parallel and so we know there isn't a bug in the bidding process in this scenario

I've added a logging statement for the case when the CalculateJobNodeDistanceDelay function decides to drop a job so we should at least be able to identify when this is causing the above problem

binocarlos commented 1 year ago

https://console.cloud.google.com/iam-admin/quotas/qirs?project=bacalhau-production

^ so adding 5 GPUs is now blocked on getting a quota increase from Google Cloud - not sure what we need to do to pay them more money :-)

binocarlos commented 1 year ago

Update on above - it was nothing to do with CalculateJobNodeDistanceDelay actually - I was totally wrong - it was compute nodes bidding out of phase and then getting stuck in a "I have reserved resources that will never be unlocked" state

https://github.com/filecoin-project/bacalhau/pull/921 is a potential fix for this

binocarlos commented 1 year ago

we've just done a bunch of testing and #921 has fixed the multiple scheduling problem - so - now we just need to increase our gcloud GPU quotas and we can close out this issue

scenaristeur commented 1 year ago

Hi , i've just discovered Bacalhau and find it's a good project. Interested in Stable diffusion, i've found this "Horde project" with crowdsourced GPU, could this help ? https://stablehorde.net/

bacalhau-project / bacalhau

add 5 GPUs to production network #899