CodedotAl / gpt-code-clippy

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
Apache License 2.0
3.29k stars 220 forks source link

How to get started? #76

Open pankajkumar229 opened 2 years ago

pankajkumar229 commented 2 years ago

Is there an easier way to get started?

I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:

The error I am at currently is: """ 2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory Traceback (most recent call last): File "run_clm_apps.py", line 800, in main() File "run_clm_apps.py", line 342, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 14, in init File "run_clm_apps.py", line 174, in __post_init__ raise ValueError("Need either a dataset name or a training/validation file.") ValueError: Need either a dataset name or a training/validation file. """ Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.

ncoop57 commented 2 years ago

What os are you trying to run this on? Also, it looks like you do not have CUDA installed properly which will make it difficult to train quickly:

2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
pankajkumar229 commented 2 years ago

HI ncoop57, Can you help me a little more? I am trying to load the data. I downloaded the dataset from the-eye.eu but I am not able to correctly pass it to training. PLease help.

pankaj@lc-tower1:~/source/gpt-code-clippy/training$ python3 run_clm_apps.py --output_dir /data/opengpt/output/ --dataset_name /data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data
11/15/2021 11:34:58 - INFO - absl - Starting the local TPU driver.
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
11/15/2021 11:34:58 - WARNING - absl - No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
11/15/2021 11:34:58 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=-1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/data/opengpt/output/runs/Nov15_11-34-58_lc-tower1,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/data/opengpt/output/,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=output,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/data/opengpt/output/,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
Traceback (most recent call last):
  File "run_clm_apps.py", line 800, in <module>
    main()
  File "run_clm_apps.py", line 384, in main
    whole_dataset = load_dataset(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 811, in load_dataset
    module_path, hash, resolved_file_path = prepare_module(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 365, in prepare_module
    raise FileNotFoundError(
FileNotFoundError: Couldn't find file locally at /data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/code_clippy_dedup_data.py. Please provide a valid dataset name
ncoop57 commented 2 years ago

Do you have this file stored here?

/data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/code_clippy_dedup_data.py

If not, I believe this file is the same one: https://github.com/CodedotAl/gpt-code-clippy/blob/camera-ready/data_processing/code_clippy_filter.py So if you copy that one and put it at the above path I believe you should be good to go

pankajkumar229 commented 2 years ago

I tried the other way. Is it possible that the huggingface method is not working anymore since the data page's format has changed? I will try the download method and try now.

Error details: Command used

python3 run_clm_apps.py --output_dir /data/opengpt/output/ --dataset_name CodedotAI/code_clippy

Output

eDatetime);\n\n\t\t\tvar getUrlParameter = function getUrlParameter(sParam) {\n\t\t\t\tvar sPageURL = window.location.search.substring(1),\n\t\t\t\tsURLVariables = sPageURL.split(\'&\'),\n\t\t\t\tsParameterName,\n\t\t\t\ti;\n\n\t\t\t\tfor (i = 0; i < sURLVariables.length; i++) {\n\t\t\t\t\tsParameterName = sURLVariables[i].split(\'=\');\n\n\t\t\t\t\tif (sParameterName[0] === sParam) {\n\t\t\t\t\t\treturn sParameterName[1] === undefined ? true : decodeURIComponent(sParameterName[1]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t};\n\t\t\tfunction toggle(className){\n\t\t\t\tvar order = getUrlParameter(\'order\');\n\t\t\t\tvar elements = document.getElementsByClassName(className);\n\t\t\t\tfor(var i = 0, length = elements.length; i < length; i++) {\n\t\t\t\t\tvar currHref = elements[i].href;\n\t\t\t\t\tif(order==\'desc\'){ \n\t\t\t\t\t\tvar chg = currHref.replace(\'desc\', \'asc\');\n\t\t\t\t\t\telements[i].href = chg;\n\t\t\t\t\t}\n\t\t\t\t\tif(order==\'asc\'){ \n\t\t\t\t\t\tvar chg = currHref.replace(\'asc\', \'desc\');\n\t\t\t\t\t\telements[i].href = chg;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t};\n\t\t\tfunction readableFileSize(size) {\n\t\t\t\tvar units = [\'B\', \'KB\', \'MB\', \'GB\', \'TB\', \'PB\', \'EB\', \'ZB\', \'YB\'];\n\t\t\t\tvar i = 0;\n\t\t\t\twhile(size >= 1024) {\n\t\t\t\t\tsize /= 1024;\n\t\t\t\t\t++i;\n\t\t\t\t}\n\t\t\t\treturn parseFloat(size).toFixed(2) + \' \' + units[i];\n\t\t\t}\n\n\t\t\tfunction changeSize() {\n\t\t\t\tvar sizes = document.getElementsByTagName("size");\n\n\t\t\t\tfor (var i = 0; i < sizes.length; i++) {\n\t\t\t\t\thumanSize = readableFileSize(sizes[i].innerHTML);\n\t\t\t\t\tsizes[i].innerHTML = humanSize\n\t\t\t\t}\n\t\t\t}\n\t\t</script>\n\t</body>\n</html>\n'
Traceback (most recent call last):
  File "run_clm_apps.py", line 800, in <module>
    main()
  File "run_clm_apps.py", line 384, in main
    whole_dataset = load_dataset(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 827, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 687, in load_dataset_builder
    builder_cls = import_main_class(module_path, dataset=True)
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 91, in import_main_class
    module = importlib.import_module(module_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d/code_clippy.py", line 68, in <module>
    url_elements = results.find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'
pankajkumar229 commented 2 years ago

If there is a command you use, could you tell me? @ncoop57

pankajkumar229 commented 2 years ago

After some not giving up talk while watching TBBT in the last season, I am finally able to at least get the data download to start.

Here are the steps:

  1. Make sure to install the requirements.txt properly and also install any package warnings during runtime.
  2. While running, it will download a file code_clippy.py and run it in the home directory . It has a few issues that need to be fixed.
    • 62 : Needs to have another slash at the end
    • 65 : Needs tbody instead of pre
    • 66,69,70,71 : Need to skip the first element of the table

Here is the command that I ran: """python3 run_clm_apps.py --output_dir /data/opengpt/output/ --cache_dir /data/opengpt/cache --dataset_name CodedotAI/code_clippy"""

reshinthadithyan commented 2 years ago

Hello Pankaj. Are you trying to fine-tune a model with the dataset? If so, my suggestion would be the following,

pankajkumar229 commented 2 years ago

Hi Reshinth

I am trying to run it on my new GPU and see how good it can get on it if possible. I am new to using transformers. So you are suggesting to download the run_clm.py file run it and pass the code_clippy python file as a parameter. LEt me try that.

pankajkumar229 commented 2 years ago

hi @reshinthadithyan and @ncoop57 , the data download often breaks with read timeouts. IS there a way to handle it?

pankajkumar229 commented 2 years ago

I think I could download the data but the command gives an error at the end before loading data (would 64GB memory suffice?):

/usr/bin/python3.8 /home/pankaj/tools/pycharm-community-2020.2/plugins/python-ce/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 45211 --file /home/pankaj/PycharmProjects/CodeClippy/CodeClippyMain.py
pydev debugger: process 3990648 is connecting
Connected to pydev debugger (build 202.6397.98)
Using the latest cached version of the module from /home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d (last modified on Sun Nov 21 18:46:34 2021) since it couldn't be found locally at ~/source/datasets/datasets/code_clippy/code_clippy.py/code_clippy.py or remotely (FileNotFoundError).
Using the latest cached version of the module from /home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d (last modified on Sun Nov 21 18:46:34 2021) since it couldn't be found locally at ~/source/datasets/datasets/code_clippy/code_clippy.py/code_clippy.py or remotely (FileNotFoundError).
No config specified, defaulting to: code_clippy/code_clippy_dedup_data
Downloading and preparing dataset code_clippy/code_clippy_dedup_data (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /data/opengpt/code_clippy/code_clippy_dedup_data/0.1.0/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d...
Traceback (most recent call last):
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 1075, in _prepare_split
    writer.write(example, key)
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 347, in write
    self.write_examples_on_file()
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 292, in write_examples_on_file
    pa_array = pa.array(typed_sequence)
  File "pyarrow/array.pxi", line 223, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 99, in __arrow_array__
    out = pa.array(self.data, type=type)
  File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 583, in download_and_prepare
    self._download_and_prepare(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 661, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 1077, in _prepare_split
    num_examples, num_bytes = writer.finalize()
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 417, in finalize
    self.write_examples_on_file()
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 292, in write_examples_on_file
    pa_array = pa.array(typed_sequence)
  File "pyarrow/array.pxi", line 223, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 99, in __arrow_array__
    out = pa.array(self.data, type=type)
  File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
python-BaseException
ncoop57 commented 2 years ago

Hey @pankajkumar229 that should be enough as long as you read it with streaming mode enabled, else it will not work. However, the error you are showing seems to be different and one that I'm unsure why it is happening. Could you share the CodeClippyMain.py file you are running?