I'm having trouble training the masked language model distilbert/distilbert_base_cased. The error message looks like this:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 1034, saw 7
To Reproduce
Steps to reproduce the behavior:
Download all relevant files, create two directories, one called 'hw2_configs" and the other 'hw2_files'
It seems that I can't upload .yaml files, so I'll just paste the contents here:
For training the positive model (titled train_pos.yaml)l:
exp: TextClassification
Describe the bug
I'm having trouble training the masked language model distilbert/distilbert_base_cased. The error message looks like this: pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 1034, saw 7
To Reproduce
Steps to reproduce the behavior:
It seems that I can't upload .yaml files, so I'll just paste the contents here: For training the positive model (titled train_pos.yaml)l: exp: TextClassification
mode:
models: hf_masked_model:
trainfpath: hw2_files/pos_train_train.txt validfpath: hw2_files/pos_train_val.txt modelfpath: imdb_pos_model
epochs: 1
For training the negative model (titled train_neg.yaml)l: exp: TextClassification
mode:
models: hf_masked_model:
trainfpath: hw2_files/neg_train_train.txt validfpath: hw2_files/neg_train_val.txt modelfpath: imdb_neg_model
epochs: 1
python main.py hw2_configs/train_neg.yaml python main.py hw2_configs/train_pos.yaml
Expected behavior
The tokenizer should not recognize more than 1 field for the data, training should go smoothly.
Observed behavior
The model run into error while generating test split, saying error tokenizing data and expected less fields than actually observed.
Screenshots
Here are the screenshots with errors on both config files:
Setup (please complete the following information)