Seq2seq: failure when evaluating while training

Futyn-Maker commented 1 year ago

Describe the bug During the training of Seq2seq-type models, with evaluation enabled, a Pandas error ValueError: All arrays must be of the same length occurs in Google Colab (free plan) during evaluation. To Reproduce Steps to reproduce the behavior:

def main(args):
    model_args = {
        "do_lower_case": True,
        "reprocess_input_data": True,
        "overwrite_output_dir": True,
        "max_seq_length": max([len(token) for token in train_df["target_text"].tolist()]),
        "train_batch_size": 256
        "num_train_epochs": 5
        "save_eval_checkpoints": False,
        "save_model_every_epoch": False,
        "evaluate_during_training": True,
        "evaluate_during_training_verbose": True,
        "use_multiprocessing": False,
        "save_best_model": False,
        "max_length": max([len(token) for token in train_df["input_text"].tolist()]),
        "save_steps": -1,
    }
    model = Seq2SeqModel(
        encoder_decoder_type="bart"
        encoder_decoder_name="facebook/bart-base"
        args=model_args,
    use_cuda = torch.cuda.is_available(),)    
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)

Expected behavior A clear and concise description of what you expected to happen.

Learning and evaluating without failures

Screenshots If applicable, add screenshots to help explain your problem. Not applicable

Desktop (please complete the following information):

OS Windows 11 (But in fact using in Google Colab) Additional context Add any other context about the problem here.

Here are the reduced logs:

2023-05-06 17:56:43.337391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading (…)lve/main/config.json: 100% 1.72k/1.72k [00:00<00:00, 8.86MB/s]
Downloading pytorch_model.bin: 100% 558M/558M [00:25<00:00, 21.6MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 1.29MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 875kB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 1.56MB/s]
100% 79032/79032 [00:21<00:00, 3710.01it/s]
Epoch 1 of 5:   0% 0/5 [00:00<?, ?it/s]
Running Epoch 0 of 5:   0% 0/309 [00:00<?, ?it/s]
Epochs 1/5. Running Loss:   10.1010:   0% 0/309 [00:03<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Epochs 1/5. Running Loss:   10.1010:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   1% 2/309 [00:03<08:38,  1.69s/it]
...
Epochs 1/5. Running Loss:    0.0396: 100% 309/309 [02:17<00:00,  2.26it/s]
  0% 0/10011 [00:00<?, ?it/s]
  0% 1/10011 [00:31<86:57:41, 31.27s/it] (some strange deadlock here)
100% 10011/10011 [01:14<00:00, 135.21it/s]
Epoch 1 of 5:   0% 0/5 [03:59<?, ?it/s]
Traceback (most recent call last):
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 56, in <module>
    main(args)
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 45, in main
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 450, in train_model
    global_step, training_details = self.train(
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 1005, in train
    report = pd.DataFrame(training_progress_scores)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 664, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 666, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

The-One-Who-Speaks-and-Depicts commented 1 year ago

The same problem, repeats almost everywhere, with the same conditions. OSs Ubuntu & Windows 11.

Moustafa-Banbouk commented 1 year ago

Hey @The-One-Who-Speaks-and-Depicts and @Futyn-Maker, Any luck in solving this issue as I am facing the same error during model_train

The-One-Who-Speaks-and-Depicts commented 1 year ago

@Moustafa-Banbouk I have been experiencing this for the year or so, no ideas. I just switched the validation off in args, and called it a day.

Futyn-Maker commented 1 year ago

I just switched the validation off in args, and called it a day.

The same for now, it didn't really interfere with the project I was working on at the time - but I consider it an extremely critical bug.

DamithDR commented 1 year ago

@The-One-Who-Speaks-and-Depicts @Futyn-Maker @Moustafa-Banbouk Can you guys try disabling multi-processing using,

use_multiprocessing = False
use_multiprocessing_for_evaluation = False

The-One-Who-Speaks-and-Depicts commented 1 year ago

@DamithDR At least in my case it works, I created a PR.

@Futyn-Maker @Moustafa-Banbouk /fyi

DamithDR commented 1 year ago

@The-One-Who-Speaks-and-Depicts Glad that it helped :) About the PR, I think this issue only re-creates on servers which are having multiple GPUs. The real issue is in the Seq2SeqDataset class where it initiates a pool of processes to get the sample list. A proper fix will have to look into this area.

The-One-Who-Speaks-and-Depicts commented 1 year ago

@DamithDR I had this issue on my laptop, and on a server, where I used only one GPU.

ThilinaRajapakse / simpletransformers

Seq2seq: failure when evaluating while training #1522