GoogleCloudPlatform / pubsec-declarative-toolkit

The GCP PubSec Declarative Toolkit is a collection of declarative solutions to help you on your Journey to Google Cloud. Solutions are designed using Config Connector and deployed using Config Controller.

Apache License 2.0

31 stars 28 forks source link

Bootstrap TPU project

FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

tflow.py
```
import tensorflow as tf
```

modify for TF - TPUStrategy

https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#used-in-the-notebooks

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])

cifar = tf.keras.datasets.cifar100 (x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():

https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50

https://keras.io/api/models/model/

parallel_model = tf.keras.applications.ResNet50( include_top=True, weights=None, input_shape=(32, 32, 3), classes=100,)

https://saturncloud.io/blog/how-to-do-multigpu-training-with-keras/

parallel_model = multi_gpu_model(model, gpus=2)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)

https://keras.io/api/models/model_training_apis/

parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"]) parallel_model.fit(x_train, y_train, epochs=10, batch_size=256)#5120)#7168)#7168)



<img width="904" alt="Screenshot 2023-11-28 at 09 38 37" src="https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/assets/24765473/e3a26b3d-b7b8-44ee-b7df-a78b0cc617e9">

## Reference
- detail comparison between NVidia tensor cores in the latest on-prem https://resources.nvidia.com/en-us-design-viz-stories-ep/rtx-5000-ada-datasheet?lx=CCKW39&contentType=data-sheet 
- detail comparison between GCP L4 VMs with NVidia tensor cores https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus

P.256 of Generative Deep Learning 2nd Edition - David Foster https://towardsdatascience.com/how-to-build-an-llm-from-scratch-8c477768f1f9 https://github.com/allenai/allennlp/discussions/5056 https://support.terra.bio/hc/en-us/community/posts/4787320149915-Requester-Pays-Google-buckets-not-asking-for-project-to-bill

Google Project Gemini https://blog.google/technology/ai/google-io-2023-keynote-sundar-pichai/#ai-products

C4 = Colossal Clean Crawled Corpus start 20231203:0021 - estimate $100 US for gcs egress An average of 300mbps with peaks of 900mbps from the GCP bucket means 800GB x 8 bits = 6400Gbits at .3Gbps = 6hours ~ ETA 36GB in 26 min = 25MB/sec = 200mbps = 11h (possibly limited by the hdd - go directly to NVMe next time

checked 0845 done
copy test HDD to HDD no raid -849 ~ 1330 = 4.5h
HDD to NVMe 1400-1455 - 250Mbps ~1h
copy test NVMe to NVMe 1456- 4-8 min 3.4-1.4 GB/s (thermal throttling) (990 pro 50% of max 8GB/s)

$93 US for GCS egress