[BUG/ERROR] Parse Error when Tokenizing Training Dataset

Describe the bug

I'm having trouble training the masked language model distilbert/distilbert_base_cased. The error message looks like this: pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 1034, saw 7

To Reproduce

Steps to reproduce the behavior:

Download all relevant files, create two directories, one called 'hw2_configs" and the other 'hw2_files'
Drag relevant files into respective folders, .txt files go into hw2_files, .yaml files go into hw2_configs neg_train_train.txt neg_train_val.txt pos_train_train.txt pos_train_val.txt

It seems that I can't upload .yaml files, so I'll just paste the contents here: For training the positive model (titled train_pos.yaml)l: exp: TextClassification

mode:

train

models: hf_masked_model:

distilbert/distilbert-base-cased

trainfpath: hw2_files/pos_train_train.txt validfpath: hw2_files/pos_train_val.txt modelfpath: imdb_pos_model

epochs: 1

For training the negative model (titled train_neg.yaml)l: exp: TextClassification

mode:

train

models: hf_masked_model:

distilbert/distilbert-base-cased

trainfpath: hw2_files/neg_train_train.txt validfpath: hw2_files/neg_train_val.txt modelfpath: imdb_neg_model

epochs: 1

conda activate nlp, then run the config files

python main.py hw2_configs/train_neg.yaml python main.py hw2_configs/train_pos.yaml

See error

Expected behavior

The tokenizer should not recognize more than 1 field for the data, training should go smoothly.

Observed behavior

The model run into error while generating test split, saying error tokenizing data and expected less fields than actually observed.

Screenshots

Here are the screenshots with errors on both config files:

Setup (please complete the following information)

OS: MacOS

forrestdavis / NLPScholar