Open pankajkumar229 opened 2 years ago
What os are you trying to run this on? Also, it looks like you do not have CUDA installed properly which will make it difficult to train quickly:
2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
HI ncoop57, Can you help me a little more? I am trying to load the data. I downloaded the dataset from the-eye.eu but I am not able to correctly pass it to training. PLease help.
pankaj@lc-tower1:~/source/gpt-code-clippy/training$ python3 run_clm_apps.py --output_dir /data/opengpt/output/ --dataset_name /data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data
11/15/2021 11:34:58 - INFO - absl - Starting the local TPU driver.
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
11/15/2021 11:34:58 - WARNING - absl - No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
11/15/2021 11:34:58 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=-1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/data/opengpt/output/runs/Nov15_11-34-58_lc-tower1,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/data/opengpt/output/,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=output,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/data/opengpt/output/,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
Traceback (most recent call last):
File "run_clm_apps.py", line 800, in <module>
main()
File "run_clm_apps.py", line 384, in main
whole_dataset = load_dataset(
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 811, in load_dataset
module_path, hash, resolved_file_path = prepare_module(
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 365, in prepare_module
raise FileNotFoundError(
FileNotFoundError: Couldn't find file locally at /data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/code_clippy_dedup_data.py. Please provide a valid dataset name
Do you have this file stored here?
/data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/code_clippy_dedup_data.py
If not, I believe this file is the same one: https://github.com/CodedotAl/gpt-code-clippy/blob/camera-ready/data_processing/code_clippy_filter.py So if you copy that one and put it at the above path I believe you should be good to go
I tried the other way. Is it possible that the huggingface method is not working anymore since the data page's format has changed? I will try the download method and try now.
Error details: Command used
python3 run_clm_apps.py --output_dir /data/opengpt/output/ --dataset_name CodedotAI/code_clippy
Output
eDatetime);\n\n\t\t\tvar getUrlParameter = function getUrlParameter(sParam) {\n\t\t\t\tvar sPageURL = window.location.search.substring(1),\n\t\t\t\tsURLVariables = sPageURL.split(\'&\'),\n\t\t\t\tsParameterName,\n\t\t\t\ti;\n\n\t\t\t\tfor (i = 0; i < sURLVariables.length; i++) {\n\t\t\t\t\tsParameterName = sURLVariables[i].split(\'=\');\n\n\t\t\t\t\tif (sParameterName[0] === sParam) {\n\t\t\t\t\t\treturn sParameterName[1] === undefined ? true : decodeURIComponent(sParameterName[1]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t};\n\t\t\tfunction toggle(className){\n\t\t\t\tvar order = getUrlParameter(\'order\');\n\t\t\t\tvar elements = document.getElementsByClassName(className);\n\t\t\t\tfor(var i = 0, length = elements.length; i < length; i++) {\n\t\t\t\t\tvar currHref = elements[i].href;\n\t\t\t\t\tif(order==\'desc\'){ \n\t\t\t\t\t\tvar chg = currHref.replace(\'desc\', \'asc\');\n\t\t\t\t\t\telements[i].href = chg;\n\t\t\t\t\t}\n\t\t\t\t\tif(order==\'asc\'){ \n\t\t\t\t\t\tvar chg = currHref.replace(\'asc\', \'desc\');\n\t\t\t\t\t\telements[i].href = chg;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t};\n\t\t\tfunction readableFileSize(size) {\n\t\t\t\tvar units = [\'B\', \'KB\', \'MB\', \'GB\', \'TB\', \'PB\', \'EB\', \'ZB\', \'YB\'];\n\t\t\t\tvar i = 0;\n\t\t\t\twhile(size >= 1024) {\n\t\t\t\t\tsize /= 1024;\n\t\t\t\t\t++i;\n\t\t\t\t}\n\t\t\t\treturn parseFloat(size).toFixed(2) + \' \' + units[i];\n\t\t\t}\n\n\t\t\tfunction changeSize() {\n\t\t\t\tvar sizes = document.getElementsByTagName("size");\n\n\t\t\t\tfor (var i = 0; i < sizes.length; i++) {\n\t\t\t\t\thumanSize = readableFileSize(sizes[i].innerHTML);\n\t\t\t\t\tsizes[i].innerHTML = humanSize\n\t\t\t\t}\n\t\t\t}\n\t\t</script>\n\t</body>\n</html>\n'
Traceback (most recent call last):
File "run_clm_apps.py", line 800, in <module>
main()
File "run_clm_apps.py", line 384, in main
whole_dataset = load_dataset(
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 827, in load_dataset
builder_instance = load_dataset_builder(
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 687, in load_dataset_builder
builder_cls = import_main_class(module_path, dataset=True)
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 91, in import_main_class
module = importlib.import_module(module_path)
File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d/code_clippy.py", line 68, in <module>
url_elements = results.find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'
If there is a command you use, could you tell me? @ncoop57
After some not giving up talk while watching TBBT in the last season, I am finally able to at least get the data download to start.
Here are the steps:
Here is the command that I ran: """python3 run_clm_apps.py --output_dir /data/opengpt/output/ --cache_dir /data/opengpt/cache --dataset_name CodedotAI/code_clippy"""
Hello Pankaj. Are you trying to fine-tune a model with the dataset? If so, my suggestion would be the following,
datasets.Dataset
object by datasets.load_dataset("PATH_TO_DATASETS_SCRIPT",split="train")
. datasets.Dataset
object to a standard, very occasionally updated HF provided run_clm.py script. Since this would provide you with a much quicker hack.
Let me know if I am understanding your problem right and the solution is catered to your need. Feel free to make a PR if you find a fix in the existing bugs. Thanks.
Hi Reshinth
I am trying to run it on my new GPU and see how good it can get on it if possible. I am new to using transformers. So you are suggesting to download the run_clm.py file run it and pass the code_clippy python file as a parameter. LEt me try that.
hi @reshinthadithyan and @ncoop57 , the data download often breaks with read timeouts. IS there a way to handle it?
I think I could download the data but the command gives an error at the end before loading data (would 64GB memory suffice?):
/usr/bin/python3.8 /home/pankaj/tools/pycharm-community-2020.2/plugins/python-ce/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 45211 --file /home/pankaj/PycharmProjects/CodeClippy/CodeClippyMain.py
pydev debugger: process 3990648 is connecting
Connected to pydev debugger (build 202.6397.98)
Using the latest cached version of the module from /home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d (last modified on Sun Nov 21 18:46:34 2021) since it couldn't be found locally at ~/source/datasets/datasets/code_clippy/code_clippy.py/code_clippy.py or remotely (FileNotFoundError).
Using the latest cached version of the module from /home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d (last modified on Sun Nov 21 18:46:34 2021) since it couldn't be found locally at ~/source/datasets/datasets/code_clippy/code_clippy.py/code_clippy.py or remotely (FileNotFoundError).
No config specified, defaulting to: code_clippy/code_clippy_dedup_data
Downloading and preparing dataset code_clippy/code_clippy_dedup_data (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /data/opengpt/code_clippy/code_clippy_dedup_data/0.1.0/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d...
Traceback (most recent call last):
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 1075, in _prepare_split
writer.write(example, key)
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 347, in write
self.write_examples_on_file()
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 292, in write_examples_on_file
pa_array = pa.array(typed_sequence)
File "pyarrow/array.pxi", line 223, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 99, in __arrow_array__
out = pa.array(self.data, type=type)
File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 583, in download_and_prepare
self._download_and_prepare(
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 661, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 1077, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 417, in finalize
self.write_examples_on_file()
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 292, in write_examples_on_file
pa_array = pa.array(typed_sequence)
File "pyarrow/array.pxi", line 223, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 99, in __arrow_array__
out = pa.array(self.data, type=type)
File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
python-BaseException
Hey @pankajkumar229 that should be enough as long as you read it with streaming mode enabled, else it will not work. However, the error you are showing seems to be different and one that I'm unsure why it is happening. Could you share the CodeClippyMain.py
file you are running?
Is there an easier way to get started?
I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:
The error I am at currently is: """ 2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory Traceback (most recent call last): File "run_clm_apps.py", line 800, in
main()
File "run_clm_apps.py", line 342, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 14, in init
File "run_clm_apps.py", line 174, in __post_init__
raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.
"""
Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.