Closed pborgesdbx closed 1 year ago
@matthayes I wonder if you've seen that? the code looks right, wondering if the colons in the filename are somehow making it look like a URL and/or that's an HF bug. But maybe you can confirm/deny that the code with those types of path should be working
I am having this issue as well. I will try changing the timestamp
definition to remove the colons.
timestamp = datetime.now().strftime("%Y-%m-%dT%H%M%S")
Removing colons from the path does not work. I also tried it with a local path instead of the absolute path.
Using the stacktrace I believe the huggingface transformers code that should determine it is a local path is found here: https://github.com/huggingface/transformers/blob/v4.27.4/src/transformers/utils/hub.py#L376
Specifically it is checking that the path is a directory, using os.path.isdir(path_or_repo_id)
From that, and based on the docstring for AutoTokenizer.from_pretrained()
it should accept these types of paths:
Params:
pretrained_model_name_or_path (`str` or `os.PathLike`):
Can be either:
- A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co.
Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
user or organization name, like `dbmdz/bert-base-german-cased`.
- A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved
using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`.
- A path or url to a single saved vocabulary file if and only if the tokenizer only requires a
single vocabulary file (like Bert or XLNet), e.g.: `./my_model_directory/vocab.txt`. (Not
applicable to all derived classes)
Have you checked if the path /root/dolly_training/dolly__2023-03-30T01:11:56
was created successfully? Looking at the code linked by @zcking, it appears that the directory may not exist.
Also can you confirm that training succeeded? This could be another reason why the path doesn't exist.
I think Matt has a point; please see the training trace:
2023-03-30 22:09:13 INFO [root] Exception while sending command.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
[2023-03-30 22:09:18,392] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-30 22:09:18,402] [INFO] [runner.py:548:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-75c87d05-950d-4b4f-afed-90d6fb141b40/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --deepspeed /Workspace/Repos/paulo.borges@databricks.com/dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir /root/dolly_training/dolly__2023-03-30T22:09:08 --dbfs-output-dir /dbfs/dolly_training/dolly__2023-03-30T22:09:08 --per-device-train-batch-size 8 --per-device-eval-batch-size 8 --lr 1e-5
[2023-03-30 22:09:22,283] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-30 22:09:22,284] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-30 22:09:22,284] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-30 22:09:22,284] [INFO] [launch.py:162:main] dist_world_size=1
[2023-03-30 22:09:22,284] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
2023-03-30 22:09:24.104552: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-30 22:09:24.241381: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-03-30 22:09:33 INFO [__main__] Loading tokenizer for EleutherAI/gpt-j-6B
2023-03-30 22:09:33 INFO [__main__] Loading model for EleutherAI/gpt-j-6B
[2023-03-30 22:11:39,404] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 4322
[2023-03-30 22:11:39,434] [ERROR] [launch.py:324:sigkill_handler] ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-75c87d05-950d-4b4f-afed-90d6fb141b40/bin/python', '-u', '-m', 'training.trainer', '--local_rank=0', '--deepspeed', '/Workspace/Repos/paulo.borges@databricks.com/dolly/config/ds_z3_bf16_config.json', '--epochs', '1', '--local-output-dir', '/root/dolly_training/dolly__2023-03-30T22:09:08', '--dbfs-output-dir', '/dbfs/dolly_training/dolly__2023-03-30T22:09:08', '--per-device-train-batch-size', '8', '--per-device-eval-batch-size', '8', '--lr', '1e-5'] exits with return code = -9
When I run
%ls /dbfs/dolly_training
I don't see the dolly__2023-03-30T22:09:08 directory.
It appears to be crashing while loading the model. Maybe OOM? What machine type are you using? It’d help if we check the path exists after training and provide a more user friendly message. I can make that update.
It's an OOM issue, I upgraded to the following and it's now training:
{
"num_workers": 0,
"cluster_name": "LLM Cluster",
"spark_version": "12.2.x-gpu-ml-scala2.12",
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*, 4]"
},
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"spot_bid_price_percent": 100,
"ebs_volume_count": 0
},
"node_type_id": "g5.48xlarge",
"driver_node_type_id": "g5.48xlarge",
"ssh_public_keys": [],
"custom_tags": {
"ResourceClass": "SingleNode"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"autotermination_minutes": 120,
"enable_elastic_disk": true,
"cluster_source": "UI",
"init_scripts": [],
"single_user_name": "paulo.borges@databricks.com",
"enable_local_disk_encryption": false,
"data_security_mode": "SINGLE_USER",
"runtime_engine": "STANDARD",
"cluster_id": "0331-003509-44f5i1om"
}
I'm Trying to train
Model Type: EleutherAI/pythia-2.8b
Error Type: [HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name']
I Believe this cluster configuration is capable enough to train this model
Compute Detail: { "num_workers": 0, "cluster_name": "DollyPOCCluster", "spark_version": "12.2.x-gpu-ml-scala2.12", "spark_conf": { "spark.master": "local[*, 4]", "spark.databricks.cluster.profile": "singleNode" }, "aws_attributes": { "first_on_demand": 1, "availability": "SPOT_WITH_FALLBACK", "zone_id": "auto", "spot_bid_price_percent": 100, "ebs_volume_count": 0 }, "node_type_id": "g4dn.2xlarge", "driver_node_type_id": "g4dn.2xlarge", "ssh_public_keys": [], "custom_tags": { "ResourceClass": "SingleNode" }, "spark_env_vars": {}, "autotermination_minutes": 20, "enable_elastic_disk": true, "cluster_source": "UI", "init_scripts": [], "enable_local_disk_encryption": false, "data_security_mode": "NONE", "runtime_engine": "STANDARD", "cluster_id": "0517-050920-dmg5higv" }
RUN LOGS:
[2023-05-17 10:15:28,465] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-17 10:15:28,473] [INFO] [runner.py:550:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-07b069a9-fc74-46da-b629-0e873bf200ec/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --input-model EleutherAI/pythia-2.8b --deepspeed /Workspace/Repos/dolly/config/ds_z3_bf16_config.json --epochs 2 --local-output-dir /local_disk0/dolly_training/dolly2023-05-17T10-15-18 --dbfs-output-dir /dbfs/dolly_training/dolly__2023-05-17T10-15-18 --per-device-train-batch-size 6 --per-device-eval-batch-size 6 --logging-steps 10 --save-steps 200 --save-total-limit 20 --eval-steps 50 --warmup-steps 50 --test-size 200 --lr 5e-6
[2023-05-17 10:15:32,322] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-17 10:15:32,323] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-17 10:15:32,323] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-17 10:15:32,323] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-17 10:15:32,323] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
2023-05-17 10:15:34.898239: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-17 10:15:35.037391: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0
.
2023-05-17 10:15:43 INFO [main] Loading tokenizer for EleutherAI/pythia-2.8b
Downloading (…)okenizer_config.json: 100%|█████| 396/396 [00:00<00:00, 59.8kB/s]
Downloading (…)/main/tokenizer.json: 100%|█| 2.11M/2.11M [00:00<00:00, 6.89MB/s]
Downloading (…)cial_tokens_map.json: 100%|███| 99.0/99.0 [00:00<00:00, 59.9kB/s]
2023-05-17 10:15:45 INFO [main] Loading model for EleutherAI/pythia-2.8b
Downloading (…)lve/main/config.json: 100%|██████| 571/571 [00:00<00:00, 339kB/s]
Downloading pytorch_model.bin: 100%|████████| 5.68G/5.68G [00:31<00:00, 182MB/s]
2023-05-17 10:16:48 INFO [main] Found max lenth: 2048
2023-05-17 10:16:48 INFO [main] Loading dataset from /dbfs/FileStore/tables/clinical_dolly.jsonl
2023-05-17 10:16:49 WARNING [datasets.builder] Found cached dataset json (/root/.cache/huggingface/datasets/json/clinical_dolly.jsonl-764c9aeb7bf7fac0/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 543.51it/s]
2023-05-17 10:16:49 INFO [main] Found 26 rows
2023-05-17 10:16:49 INFO [main] Preprocessing dataset
2023-05-17 10:16:49 INFO [main] Processed dataset has 26 rows
2023-05-17 10:16:49 INFO [main] Processed dataset has 26 rows after filtering for truncated records
2023-05-17 10:16:49 INFO [main] Shuffling dataset
2023-05-17 10:16:49 INFO [main] Done preprocessing
2023-05-17 10:16:49 ERROR [main] main failed
Traceback (most recent call last):
File "/Workspace/Repos/dolly/training/trainer.py", line 329, in
Looks like you did not finish training or ran OOM. Search similar issues here
I am getting a validation error on CMD 11:
Here's the traceback:
Error is occurring for both:
Cluster config: