Closed yechoi7 closed 1 year ago
Hi @yechoi7,
Thank you for raising this issue! Indeed, this is a typo in the code. I have just pushed up a fix (f7ceb4ecf7e78b5ac0e9da478322bc10511c8ca9). This issue shouldn't impact any of the results in the paper---sorry I thought I had corrected this when cleaning up the code.
Hi @ncmeade,
I tried running the code with the fix, but I encountered an error which was not present in the previous code run. Additionally, I would like to know if we skip the text grouping process in the fixed code.
Thanks!
The error messages are as follows:
0%| | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/document/bias-bench/experiments/run_clm.py", line 791, in <module>
main()
File "/home/document/bias-bench/experiments/run_clm.py", line 730, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/trainer.py", line 1765, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/data/data_collator.py", line 70, in default_data_collator
return torch_default_data_collator(features)
File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/data/data_collator.py", line 136, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 398 at dim 1 (got 449)
I think this is because the grouped dataset is assigned to "lm_datasets" but the following processes are still conducted on "tokenized_datasets". I can run the code after changing it to tokenized_datasets
Hi @Ewanwong, thank you for the comment. Unfortunately, it didn't work for me. I cleared the cached data and changed the name of the grouped dataset from 'lm_datasets' to 'tokenized_datasets'. Now I get this error...
Traceback (most recent call last):
File "/home/user/bias-bench/experiments/run_clm.py", line 791, in <module>
main()
File "/home/user/bias-bench/experiments/run_clm.py", line 690, in main
tokenized_datasets = tokenized_datasets.map(
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/dataset_dict.py", line 852, in map
{
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/dataset_dict.py", line 853, in <dictcomp>
k: dataset.map(
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3046, in map
for rank, done, content in iflatmap_unordered(
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
[async_result.get() for async_result in async_results]
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
[async_result.get() for async_result in async_results]
File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
pyarrow.lib.ArrowInvalid: Column 2 named labels expected length 1620 but got length 1000
Hi @yechoi7 , It worked for me when I was running dropout models, but now I got the same error as yours when training CDA gpt2 model.
Hi both, very sorry for the delay. These issues should be corrected now in the latest code I pushed (e2b901150c5eb893e2b2f4398d52fe4f22c4e351). Let me know if you're still having issues.
Closing this for now. Feel free to reopen this issue again in the future.
Hello! I have been reading your paper with great interest.
I have a question about the train_dataset parameter in run_clm.py.
After applying gender augmentation, the dataset is saved under the name 'tokenized_dataset'.
However, in the training arguments, train_dataset is given as 'lm_dataset'.
I don't see how the augmented data is included in the training process. What am I missing here?
https://github.com/McGill-NLP/bias-bench/blob/b71fa277cf1c416f058dbc20e836d4808a3300b5/experiments/run_clm.py#LL690C1-L703C85