Question about train_dataset parameter in run_clm.py

yechoi7 commented 1 year ago

Hello! I have been reading your paper with great interest. I have a question about the train_dataset parameter in run_clm.py. After applying gender augmentation, the dataset is saved under the name 'tokenized_dataset'. However, in the training arguments, train_dataset is given as 'lm_dataset'. I don't see how the augmented data is included in the training process. What am I missing here?

https://github.com/McGill-NLP/bias-bench/blob/b71fa277cf1c416f058dbc20e836d4808a3300b5/experiments/run_clm.py#LL690C1-L703C85

ncmeade commented 1 year ago

Hi @yechoi7,

Thank you for raising this issue! Indeed, this is a typo in the code. I have just pushed up a fix (f7ceb4ecf7e78b5ac0e9da478322bc10511c8ca9). This issue shouldn't impact any of the results in the paper---sorry I thought I had corrected this when cleaning up the code.

yechoi7 commented 1 year ago

Hi @ncmeade,

I tried running the code with the fix, but I encountered an error which was not present in the previous code run. Additionally, I would like to know if we skip the text grouping process in the fixed code.

Thanks!

The error messages are as follows:

 0%|          | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/document/bias-bench/experiments/run_clm.py", line 791, in <module>
    main()
  File "/home/document/bias-bench/experiments/run_clm.py", line 730, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/trainer.py", line 1765, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
    return self.collate_fn(data)
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/data/data_collator.py", line 70, in default_data_collator
    return torch_default_data_collator(features)
  File "/home/document/anaconda/envs/bench/lib/python3.9/site-packages/transformers/data/data_collator.py", line 136, in torch_default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 398 at dim 1 (got 449)

Ewanwong commented 1 year ago

I think this is because the grouped dataset is assigned to "lm_datasets" but the following processes are still conducted on "tokenized_datasets". I can run the code after changing it to tokenized_datasets

https://github.com/McGill-NLP/bias-bench/blob/b71fa277cf1c416f058dbc20e836d4808a3300b5/experiments/run_clm.py#LL526C1-L533C10

yechoi7 commented 1 year ago

Hi @Ewanwong, thank you for the comment. Unfortunately, it didn't work for me. I cleared the cached data and changed the name of the grouped dataset from 'lm_datasets' to 'tokenized_datasets'. Now I get this error...

Traceback (most recent call last):
  File "/home/user/bias-bench/experiments/run_clm.py", line 791, in <module>
    main()
  File "/home/user/bias-bench/experiments/run_clm.py", line 690, in main
    tokenized_datasets = tokenized_datasets.map(
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/dataset_dict.py", line 852, in map
    {
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/dataset_dict.py", line 853, in <dictcomp>
    k: dataset.map(
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3046, in map
    for rank, done, content in iflatmap_unordered(
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in iflatmap_unordered
    [async_result.get() for async_result in async_results]
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1373, in <listcomp>
    [async_result.get() for async_result in async_results]
  File "/home/user/anaconda/envs/bench/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
pyarrow.lib.ArrowInvalid: Column 2 named labels expected length 1620 but got length 1000

Ewanwong commented 1 year ago

Hi @yechoi7 , It worked for me when I was running dropout models, but now I got the same error as yours when training CDA gpt2 model.

ncmeade commented 1 year ago

Hi both, very sorry for the delay. These issues should be corrected now in the latest code I pushed (e2b901150c5eb893e2b2f4398d52fe4f22c4e351). Let me know if you're still having issues.

ncmeade commented 1 year ago

Closing this for now. Feel free to reopen this issue again in the future.

McGill-NLP / bias-bench

Question about train_dataset parameter in run_clm.py #14