NTDXYG / ComFormer

code and data for paper "ComFormer: Code Comment Generation via Transformer and Fusion Method-based Hybrid Code Representation" accepted in DSA2021
14 stars 3 forks source link

python train.py → ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. #3

Closed Youngmi-Park closed 2 years ago

Youngmi-Park commented 2 years ago

Hi! I get an error when i run python train.py How can I fix this?

$ python train.py

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
 10%|███▍                               | 43979/445813 [02:17<19:16, 347.51it/s]Traceback (most recent call last):
  File "train.py", line 73, in <module>
    model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge)
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 176, in train_model
    train_dataset = self.load_and_cache_examples(train_data, verbose=verbose)
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 868, in load_and_cache_examples
    dataset = SimpleSummarizationDataset(encoder_tokenizer, self.args, data, mode)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in __init__
    preprocess_fn(d) for d in tqdm(data, disable=args.silent)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in <listcomp>
    preprocess_fn(d) for d in tqdm(data, disable=args.silent)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 333, in preprocess_data_bart
    truncation=True,
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2651, in batch_encode_plus
    **kwargs,
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 731, in _batch_encode_plus
    first_ids = get_input_ids(ids)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 712, in get_input_ids
    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Thanks!

NTDXYG commented 2 years ago

It may be a problem with your dataset, make sure the input to your dataset is a string. You could try the str() method to force the conversion, but I don't recommend that. I will also make sure again if there is a problem with my code. If you can, please email me the dataset and I'll figure out what's wrong with it.

---Original--- From: "Youngmi @.> Date: Mon, Feb 28, 2022 17:05 PM To: @.>; Cc: @.***>; Subject: [NTDXYG/ComFormer] python train.py → ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. (Issue #3)

Hi! I get an error when i run python train.py How can I fix this? $ python train.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/ 10%|███▍ | 43979/445813 [02:17<19:16, 347.51it/s]Traceback (most recent call last): File "train.py", line 73, in <module> model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge) File "/home/gpuadmin/home/ComFormer/bart_model.py", line 176, in train_model train_dataset = self.load_and_cache_examples(train_data, verbose=verbose) File "/home/gpuadmin/home/ComFormer/bart_model.py", line 868, in load_and_cache_examples dataset = SimpleSummarizationDataset(encoder_tokenizer, self.args, data, mode) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in init preprocess_fn(d) for d in tqdm(data, disable=args.silent) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in <listcomp> preprocess_fn(d) for d in tqdm(data, disable=args.silent) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 333, in preprocess_data_bart truncation=True, File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2651, in batch_encode_plus **kwargs, File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 731, in _batch_encode_plus first_ids = get_input_ids(ids) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 712, in get_input_ids "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Youngmi-Park commented 2 years ago

Thank you for the prompt reply. I use 'csv' format dataset that has two columns 'input_text':code, 'target_text':comments it is from dataset-RQ1 provided from EMSE-DeepCom(https://github.com/xing-hu/EMSE-DeepCom) Please see the google drive link below. You can check train, valid, test files and the original files. https://drive.google.com/drive/folders/1y5AWGpmdN8KILsuncaKezpMUul1PSXMg?usp=sharing

I'm looking forward to your reply.

-----Original Message----- From: @.> To: @.>; Cc: "Youngmi @.>; @.>; Sent: 2022-02-28 (월) 19:07:39 (GMT+09:00) Subject: Re: [NTDXYG/ComFormer] python train.py → ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. (Issue #3)

It may be a problem with your dataset, make sure the input to your dataset is a string. You could try the str() method to force the conversion, but I don't recommend that. I will also make sure again if there is a problem with my code. If you can, please email me the dataset and I'll figure out what's wrong with it.

---Original--- From: "Youngmi @.> Date: Mon, Feb 28, 2022 17:05 PM To: @.>; Cc: @.***>; Subject: [NTDXYG/ComFormer] python train.py → ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. (Issue #3)

Hi! I get an error when i run python train.py How can I fix this? $ python train.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/ 10%|███▍ | 43979/445813 [02:17<19:16, 347.51it/s]Traceback (most recent call last): File "train.py", line 73, in <module> model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge) File "/home/gpuadmin/home/ComFormer/bart_model.py", line 176, in train_model train_dataset = self.load_and_cache_examples(train_data, verbose=verbose) File "/home/gpuadmin/home/ComFormer/bart_model.py", line 868, in load_and_cache_examples dataset = SimpleSummarizationDataset(encoder_tokenizer, self.args, data, mode) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in init preprocess_fn(d) for d in tqdm(data, disable=args.silent) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 425, in <listcomp> preprocess_fn(d) for d in tqdm(data, disable=args.silent) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/simpletransformers/seq2seq/seq2seq_utils.py", line 333, in preprocess_data_bart truncation=True, File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2651, in batch_encode_plus **kwargs, File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 731, in _batch_encode_plus first_ids = get_input_ids(ids) File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 712, in get_input_ids "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers. Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

NTDXYG commented 2 years ago

maybe you need modify the code in train.py:

train_df = pd.read_csv('data/train.csv')
eval_df = pd.read_csv('data/valid.csv')
test_df = pd.read_csv('data/test.csv')

to

train_df = pd.read_csv('data/train.csv').dropna()
eval_df = pd.read_csv('data/valid.csv').dropna()
test_df = pd.read_csv('data/test.csv').dropna()

I download the dataset and find that there is one nan in train.csv, code is followed:

import pandas as pd

df = pd.read_csv("train.csv")

df = pd.read_csv("train.csv").dropna()
input_text, target_text = df['input_text'].tolist(), df['target_text'].tolist()

for i, text in enumerate(input_text):
if(isinstance(text, str)==False):
print(text)
NTDXYG commented 2 years ago

I suggest you directly fine-tune my pre-trained model, which will significantly reduce your training time. If an OOM is reported, you can freeze some of the parameters of the model by adding the following code at line 116 in bart_model.py.

unfreeze_layers = ['layers.0', 'layers.1', 'layers.2', 'layers.3', 'layers.4', 'layers.5', 'layers.6',
                           'layers.7', 'layers.8']

for name, param in self.model.named_parameters():
    for ele in unfreeze_layers:
        if ele in name:
            param.requires_grad = False
Youngmi-Park commented 2 years ago

Thanks for your help! It can create features but another error occurs😥

$ python train.py

INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
100%|██████████████████████████████████| 445782/445782 [20:00<00:00, 371.47it/s]
/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
INFO:bart_model: Training started
Epoch 1 of 30:   0%|                                     | 0/30 [00:00<?, ?it/sINFO:bart_model:Saving model into result/checkpoint-200082 [14:53<49:54:35,  2.47
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/
100%|████████████████████████████████████| 19999/19999 [00:49<00:00, 401.96it/s]
Epochs 0/30. Running Loss:    8.9371:   0%| | 1999/445782 [19:07<70:46:25,  1.74
Epoch 1 of 30:   0%|                                     | 0/30 [19:07<?, ?it/s]
Traceback (most recent call last):
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'input_text_a'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 77, in <module>
    model.train_model(train_df, eval_data=eval_df, Rouge=getListRouge)
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 186, in train_model
    **kwargs,
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 493, in train
    **kwargs,
  File "/home/gpuadmin/home/ComFormer/bart_model.py", line 697, in eval_model
    to_predict_a = eval_data["input_text_a"].tolist()
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/gpuadmin/anaconda3/envs/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'input_text_a'
NTDXYG commented 2 years ago

It's fixed, just re-clone bart_model.py.

NTDXYG commented 2 years ago

model_args in train.py you need modify first... I forget to say this tips in readme...

Youngmi-Park commented 2 years ago

It works now! Thanks :)