ObrienlabsDev / machine-learning

Machine Learning - AI - Tensorflow - Keras - NVidia - Google
MIT License
0 stars 0 forks source link

Work with Google C4 dataset of common crawl #5

Open obriensystems opened 7 months ago

obriensystems commented 7 months ago

P.256 of Generative Deep Learning 2nd Edition - David Foster https://towardsdatascience.com/how-to-build-an-llm-from-scratch-8c477768f1f9 https://github.com/allenai/allennlp/discussions/5056 https://support.terra.bio/hc/en-us/community/posts/4787320149915-Requester-Pays-Google-buckets-not-asking-for-project-to-bill

C4 = Colossal Clean Crawled Corpus start 20231203:0021 - estimate $100 US for gcs egress An average of 300mbps with peaks of 900mbps from the GCP bucket means 800GB x 8 bits = 6400Gbits at .3Gbps = 6hours ~ ETA 36GB in 26 min = 25MB/sec = 200mbps = 11h (possibly limited by the hdd - go directly to NVMe next time

$93 US for GCS egress

Screenshot 2023-12-04 at 09 39 39

E:\c4\c4\en>gsutil -m -u your-project-id cp "gs://allennlp-tensorflow-datasets/c4/en/3.0.1/*" .
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00002-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00013-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00018-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00006-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00015-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00001-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00008-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00017-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00020-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00003-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00009-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00000-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00016-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00021-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00019-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00004-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00010-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00023-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00022-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00007-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00014-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00011-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00005-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-00012-of-01024...
/ [0/1.0k files][    0.0 B/812.4 GiB]   0% Done

0845
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-train.tfrecord-01023-of-01024...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00000-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00001-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00002-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00003-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00004-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00005-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00006-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/c4-validation.tfrecord-00007-of-00008...
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/dataset_info.json...9
Copying gs://allennlp-tensorflow-datasets/c4/en/3.0.1/features.json...04:29
\ [1.0k/1.0k files][812.4 GiB/812.4 GiB] 100% Done  97.1 MiB/s ETA 00:00:00
Operation completed over 1.0k objects/812.4 GiB.