Adding TorchBench SDL / Container Example for Gpu Benchmarking

rakataprime commented 1 year ago

Hello

The purpose of this issue to add an sdl/container for running benchmarks for pytorch with torchbench on gpu providers.

I have written an sdl and dockerfile. It is taking forever(>3hrs to dockerhub/ecr) to push the docker image due to size. I may change our approach slightly to decrease the size of the image with a slight delay in runtime start. I wanted to get some feedback from the gpu team/community before proceeding further.

We need to set some requirements for the provider benchmarks:

What is our timeout for large docker container pulls on Akash?
What is our desired time budget for running the benchmark how long to run?
What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have?
What models are most important? The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user?

Some added context: The container image is around 20Gb uncompressed, 6Gb of this is the pytorch runtime and the other 14GB are the models and code used for benchmarking. We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay.

The actual benchmark itself would take about 5-8 hours to run if run sequentially on a macbook pro skipping the gpu benchmarks and Meta currently runs the benchmark on a gpu cluster.

We don’t have to run every benchmark though and the fastest approach would be to run a small subset of relevant benchmarks with the torchbench repo installed delayed till runtime to decrease the container size.

anilmurty commented 1 year ago

Thanks for this @rakataprime -- here are some thoughts (will look to @chainzero @andy108369 @troian for additional inputs as well):

What is our timeout for large docker container pulls on Akash? [am] I am not sure if we have a limit right now. @troian or @andy108369 do you guys know?

What is our desired time budget for running the benchmark how long to run? [am] I was thinking we can leave it at the default (5min)

What is our desired computational budget for running the benchmark / smallest provider resources to assume that we have? [am] This one is tricky. Our intention is to figure out the performance we get on different GPU models and we intend to build a decent number of providers in the first phase of the testnet (with the benchmarking happening in the second phase). So the computational budget would be a range of GPU+CPU combos and the goal would be to see which ones fail, which pass and of the ones that pass, what the relative performance is. Context: https://github.com/akash-network/community/blob/main/wg-gpu/GPU-AI-Incentivized-Testnet.md

What models are most important? The older models e.g. ResNet will be more comparable across platforms, but newer models that are applications focused like a dalle2 or llama inference run are more relevant to the end user? [am] Agreed on older models being fine for the benchmark. The general thinking is that we don't care about the latest and greatest for the benchmarking exercise (reliability and consistency are more important) and we'll have a separate set of tasks for "just deploying" (not benchmarking) models, for which we will attempt to deploy the "latest and greatest". In terms of models, the below list would be great to hit (we're hoping to produce results similar to https://lambdalabs.com/gpu-benchmarks):

Screenshot 2023-06-06 at 5 55 32 PM

We could make the container a lot smaller by running the torchbench install script and downloading the models at run time which would add about a 5-10 minute start delay. [am] 100% agree on doing this.

rakataprime commented 1 year ago

So if we look at the two tiers of gpus we still have a lot of variation within those tiers.

Tier 1 H100, A100, V100, P100, A40, A10, P4, K80, T4, 4090, 4080, 3090Ti, 3090, 3080Ti, 3080, 3060Ti. Tier 2 RTX 2060, 2070, 2080, 2080Ti, GTX 1030, 1050, 1050Ti, 1060, 1070, 1070Ti, 1080, 1080Ti, 1630, 1650, 1660, 1660Ti.

For instance latest cuda 11 is deprecated for the k80 generation of cards and before.

the lowest VRAM usage for tier 1 is the 3060Ti with 8gb of VRAM, the lowest VRAM usage for tier 2 is the 1630 with 4gb of VRAM

We probably wouldn't want to run benchmarks like bert large on the cards without enough VRAM to actually run. Right now of the shared models between that the lamda list and torchbench the only models that we couldn't run on all of tier 1 would be bert or other llms.

The other kind of thorny issue is what cuda/cudnn version install on nodes. I think k8s is still limited to one driver version on the nodes and one cuda version on the nodes. Even if you could do multiple cuda versions it would hurt the distributed training if the pool was highly fragmented bc the deployment would only be able to work with a fraction of the cuda compatible nodes. If you have to keep the newly deprecated cards, it may be better to move just that generation of cards to the last supported cuda version and bump the others to the latest.

Currently the torchbench container is using pytorch 2.0.1-cuda11.7-cudnn8-runtime with python 3.10

There are some major performance improvements with latest cuda 11 and pytorch2 for generative ai, especially stable diffusion vs pytorch 1 and prior cuda version before Jan 2022.

the relevant torchbench models currently supported in that list from lambda labs are: resnet50 hf_Bert hf_Bert_large tacotron2

The models not currently included are ssd, gnmt, transformerxlbase, transformerxllarge, and baseglow. We could substitute the transformerxl with longformer, and ssd with yolov3. I'm not sure what would be a similar model for gnmt that is already in torchbench.

if that subset of the shared models is sufficient than I can refactor the container to install on run and update the entrypoint to only benchmark those shared models. Once we have a list of core models and smaller container we will have a better sense of where we stand relative to the 5 min gpu benchmark goal.

anilmurty commented 1 year ago

Thanks for the details @rakataprime - the substitutions of the models you mentioned sound fine.

The Cuda version issue should only arise in cases of a heterogenous provider (more than one GPU type in the same cluster) and if the GPUs models in the cluster require different Cuda versions, right? I think that may be a relatively uncommon case for the testnet (but could be a problem).

Thinking of the logistics of all this, is it better if we just built an SDL (or more than one SDL) that deployed a jupyter notebook with the correct python kernel and pytorch included? At least for the tensorflow models, the approach I was thinking we could take would be to have people run https://github.com/akash-network/awesome-akash/tree/master/tensorflow-jupyter-mnist and then use that instance to run the models from the list in https://github.com/tensorflow/models/tree/master/official

rakataprime commented 1 year ago

@anilmurty, if you don't actively try to coral the providers into standardized cuda versions it would prevent people from running training jobs like foundation models across multiple providers because the sdl includes 1 docker container for the training job with a cuda version dependency. My startup wants to train a foundation model with akash (lmk if you want to discuss a formal partnership on this more) , but would want to train across a huge cluster of gpus not just 1 provider. I think the you can have gpu heterogeneity but you want them to be on the same cuda/cudnn version and preferable a known minimum vram. I think in k8s you can set gpu requirements for vram with a helm plugin. VRAM resource resource requirements a setting in sdl right now? I'm not sure if i saw that in the docs.

I don't like the notebooks because they are prone to people executing cells out of order and not having functional code. You could do a notebook and then have people export after the benchmark runs as a pdf. Usually the formatting of console like output isn't that great though. I think we probably would better off writing a json output to somewhere else like s3 compatible bucket or ipfs or internal database for aggregation. It might be a lot of data to write on chain though but you could certainly write out some of the summary data on chain easily. I don't know if there is an easy cosmos python client though and you may have to use rust through python rust bridge to do that easily

anilmurty commented 1 year ago

hey @rakataprime - sorry for the late reply - somehow missed the notification of this. Would definitely be interested in discussing a partnership with you. I've reached out via discord DM to coordinate.

Re. notebooks - I was looking at them purely for the benchmarking exercise for the testnet and not really for use in production for training or inference.

Do you feel like the Pytorch SDL is usable now? Asking because I was planning to update the instructions to tell people to use either pytorch or tensorflow for the testnet exercise with a preference towards pytorch. Thanks!

rakataprime commented 1 year ago

@anilmurty , I think someone should test the torchbench sdl on the gpu testnet before we say its usable. I believe it is currently usable, but should test that assumption since the gpu testnet is up now. If we want jupyter notebook usage we should package a jupyter notebook container in the docker container or add a second one / sdl to make it easy as possible for people with clear instructions for those who may not have used jupyter before. I would also clarify how you want them to export the notebook in those intstructions as well if you want to look at 20+ submissions.

anilmurty commented 1 year ago

Thanks @rakataprime - I'll test this out and confirm https://github.com/akash-network/awesome-akash/blob/e115932a1b8e0536649a2d88f3a614f097ad2c43/torchbench/torchbench_gpu_sdl.yaml (@chainzero - would be great if you did too).

Is this usable for the jupyter notebook? https://github.com/akash-network/awesome-akash/tree/master/jupyter

anilmurty commented 1 year ago

hey @rakataprime - I just tested it and unfortunately it doesn't work because we have since added support for specifying some GPU attributes (vendor and model). Here are 3 examples of what the structure is like https://docs.akash.network/testnet/example-gpu-sdls

At the minimum the SDL needs to be updated to include the "vendor" key as shown here https://docs.akash.network/testnet/example-gpu-sdls/specific-gpu-vendor add:

          attributes:
            vendor:
              nvidia:

It still doesn't return bids (probably because there are no GPU providers on the network that meet the requirements yet) but at least the SDL is valid

rakataprime commented 1 year ago

@anilmurty the latest commit adds jupyter and an example notebook. It still needs to be tested on testnet. Also the juypter notebook implementation requires users to paste in the auth token from the logss to access.

akash-network / awesome-akash

Adding TorchBench SDL / Container Example for Gpu Benchmarking #387