GoogleCloudPlatform / pubsec-declarative-toolkit

The GCP PubSec Declarative Toolkit is a collection of declarative solutions to help you on your Journey to Google Cloud. Solutions are designed using Config Connector and deployed using Config Controller.
Apache License 2.0
30 stars 27 forks source link

GCP TPU (Tensor Processing Unit) project template for TensorFlow 2.x LLM workload training/inference enablement via the landing zone - specifically TPUv5 #742

Open obriensystems opened 7 months ago

obriensystems commented 7 months ago

Bootstrap TPU project

FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

modify for TF - TPUStrategy

https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#used-in-the-notebooks

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

cifar = tf.keras.datasets.cifar100 (x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():

https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50

https://keras.io/api/models/model/

parallel_model = tf.keras.applications.ResNet50( include_top=True, weights=None, input_shape=(32, 32, 3), classes=100,)

https://saturncloud.io/blog/how-to-do-multigpu-training-with-keras/

parallel_model = multi_gpu_model(model, gpus=2)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

https://keras.io/api/models/model_training_apis/

parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"]) parallel_model.fit(x_train, y_train, epochs=10, batch_size=256)#5120)#7168)#7168)



<img width="904" alt="Screenshot 2023-11-28 at 09 38 37" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/e3a26b3d-b7b8-44ee-b7df-a78b0cc617e9">

## Reference
- detail comparison between NVidia tensor cores in the latest on-prem https://resources.nvidia.com/en-us-design-viz-stories-ep/rtx-5000-ada-datasheet?lx=CCKW39&contentType=data-sheet 
- detail comparison between GCP L4 VMs with NVidia tensor cores https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus
obriensystems commented 7 months ago

Add TPUv5 capability

obriensystems commented 7 months ago

P.256 of Generative Deep Learning 2nd Edition - David Foster https://towardsdatascience.com/how-to-build-an-llm-from-scratch-8c477768f1f9 https://github.com/allenai/allennlp/discussions/5056 https://support.terra.bio/hc/en-us/community/posts/4787320149915-Requester-Pays-Google-buckets-not-asking-for-project-to-bill

C4 = Colossal Clean Crawled Corpus start 20231203:0021 - estimate $100 US for gcs egress An average of 300mbps with peaks of 900mbps from the GCP bucket means 800GB x 8 bits = 6400Gbits at .3Gbps = 6hours ~ ETA 36GB in 26 min = 25MB/sec = 200mbps = 11h (possibly limited by the hdd - go directly to NVMe next time

$93 US for GCS egress

Screenshot 2023-12-04 at 09 39 39