forrestdavis / NLPScholar

Tools for training an NLP Scholar
GNU General Public License v3.0
5 stars 2 forks source link

[BUG/ERROR] Parse Error when Tokenizing Training Dataset #13

Open yimeng-blake opened 18 hours ago

yimeng-blake commented 18 hours ago

Describe the bug

I'm having trouble training the masked language model distilbert/distilbert_base_cased. The error message looks like this: pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 1034, saw 7

To Reproduce

Steps to reproduce the behavior:

  1. Download all relevant files, create two directories, one called 'hw2_configs" and the other 'hw2_files'
  2. Drag relevant files into respective folders, .txt files go into hw2_files, .yaml files go into hw2_configs neg_train_train.txt neg_train_val.txt pos_train_train.txt pos_train_val.txt

It seems that I can't upload .yaml files, so I'll just paste the contents here: For training the positive model (titled train_pos.yaml)l: exp: TextClassification

mode:

models: hf_masked_model:

trainfpath: hw2_files/pos_train_train.txt validfpath: hw2_files/pos_train_val.txt modelfpath: imdb_pos_model

epochs: 1

For training the negative model (titled train_neg.yaml)l: exp: TextClassification

mode:

models: hf_masked_model:

trainfpath: hw2_files/neg_train_train.txt validfpath: hw2_files/neg_train_val.txt modelfpath: imdb_neg_model

epochs: 1

  1. conda activate nlp, then run the config files

python main.py hw2_configs/train_neg.yaml python main.py hw2_configs/train_pos.yaml

  1. See error

Expected behavior

The tokenizer should not recognize more than 1 field for the data, training should go smoothly.

Observed behavior

The model run into error while generating test split, saying error tokenizing data and expected less fields than actually observed.

Screenshots

Here are the screenshots with errors on both config files:

Screenshot 2024-10-24 at 18 30 26 Screenshot 2024-10-24 at 18 30 42

Setup (please complete the following information)