Closed leonardtang closed 2 years ago
Hi @leonardtang! Good on ya for getting into Harvard (sorry, couldn't resist). To answer your question, you chose your dataset to be nlp/multiple_choice/race
so that's what it's going to use. To use your own dataset, you need:
pl-transformers-train \
task=nlp/multiple_choice \
dataset.cfg.train_file=/data/leonardtang/MAUD/data/RACE_data.json \
dataset.cfg.validation_file=/data/leonardtang/MAUD/data/RACE_valid.json \
I checked the docs (where your code came from), and I think they need to be updated. :[
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Multiple Choice Dataset Files not Overwritten
Even having overwritten the default dataset files for the multiple-choice (in particular, RACE) task, it seems that the training scripts are still using the original RACE dataset and not my custom dataset files. As a dummy example, I'm just using the JSON file from the docs:
To Reproduce
Steps to reproduce the behavior:
pl-transformers-train task=nlp/multiple_choice dataset=nlp/multiple_choice/race dataset.cfg.train_file=/data/leonardtang/MAUD/data/RACE_data.json dataset.cfg.validation_file=/data/leonardtang/MAUD/data/RACE_valid.json
Resulting output:
Epoch 0: 0%| | 13/5798 [01:01<7:04:19, 4.40s/it, loss=1.39, train_loss=1.350]
As you can see, there are 5798 batches (size of the original RACE dataset, not the 1-example toy dataset I am testing on).
Environment
conda
,pip
, source): pip