Unable to train with custom data

DineshReddyK commented 2 years ago

Hi, when I try to train a model from scratch I am facing following error. The data_dir contains less data, so I think CPU should be sufficient in my case. So what exactly could cause this? @ncoop57 can you please check and help.

./run_clm_streaming_flax.py \
    --output_dir $HOME/fhgw-gpt-neo-125M-code-clippy \
    --dataset_name /home/fedora/explore/clippy/gpt-code-clippy/data_processing/code_clippy.py \
    --data_dir /mnt/vol/FHGW/scm_fhgw/workspace_FHGW_21.000/FHGW-NW-CM \
    --text_column_name="text" \
    --do_train --do_eval \
    --block_size="2048" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="16" \
    --preprocessing_num_workers="8" \
    --learning_rate="1e-4" \
    --max_steps 100000 \
    --warmup_steps 2500 \
    --decay_steps 25000 \
    --adam_beta1="0.9" \
    --adam_beta2="0.95" \
    --weight_decay="0.1" \
    --overwrite_output_dir \
    --logging_steps="100" \
    --eval_steps="500" \
    --push_to_hub="False" \
    --report_to="all" \
    --dtype="bfloat16" \
    --skip_memory_metrics="True" \
    --save_steps="500" \
    --save_total_limit 10 \
    --gradient_accumulation_steps 16 \
    --report_to="wandb" \
    --run_name="125m_1e-4lr_1024bs" \
    --max_eval_samples 2000 \
    --save_optimizer true

2022-01-06 08:27:11.271076: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
INFO:__main__:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy/runs/Jan06_08-27-13_fedora.novalocal,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=100000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=125m_1e-4lr_1024bs,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=10,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=2500,
weight_decay=0.1,
xpu_backend=None,
)
WARNING:datasets.builder:Using custom data configuration default-01c596fb6133304a
Traceback (most recent call last):
  File "/usr/lib64/python3.7/pathlib.py", line 713, in __str__
    return self._str
AttributeError: _str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./run_clm_streaming_flax.py", line 774, in <module>
    main()
  File "./run_clm_streaming_flax.py", line 392, in main
    split="train"
  File "/usr/local/lib/python3.7/site-packages/datasets/load.py", line 1686, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/site-packages/datasets/builder.py", line 897, in as_streaming_dataset
    splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
  File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in _split_generators
    gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
  File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in <listcomp>
    gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  [Previous line repeated 984 more times]
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 381, in xpathglob
    posix_path = _as_posix(path)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 172, in _as_posix
    path_as_posix = path.as_posix()
  File "/usr/lib64/python3.7/pathlib.py", line 726, in as_posix
    return str(self).replace(f.sep, '/')
  File "/usr/lib64/python3.7/pathlib.py", line 716, in __str__
    self._parts) or '.'
  File "/usr/lib64/python3.7/pathlib.py", line 695, in _format_parsed_parts
    return drv + root + cls._flavour.join(parts[1:])
RecursionError: maximum recursion depth exceeded while calling a Python object

reshinthadithyan commented 2 years ago

Hello, Dinesh. Please refer to https://github.com/CodedotAl/gpt-code-clippy/issues/74, this talks about the issue. And moreover, my suggestion would be to use updated script(s) present in the HF repository. It is more stable and heavily tested. Let me know if you need further assistance.

DineshReddyK commented 2 years ago

Thank you so much for the quick response. Let me have a look at those!

CodedotAl / gpt-code-clippy

Unable to train with custom data #77