Open LLLiHaotian opened 3 months ago
You need set --train_group_size 1
I noticed that the parameter definition about train_group_size is the number of positive and negatives for a query in training. There are always one positive, so this argument will control the number of negatives (#negatives=train_group_size-1). Noted that the number of negatives should not be larger than the numbers of negatives in data "neg":List[str]. Besides the negatives in this group, the in-batch negatives also will be used in fine-tuning. It is mentioned that there is always one pos, so if it is the following situation, how should this parameter be defined specifically? 1、query-1、pos-1、neg-8 2、query-1、pos-9、neg-0 3、query-1、pos-9、neg-8
Sorry, there is another situation: 4、query-1、pos-1
If none of the samples contain negative samples, set train_group_size =1. If only some samples lack negative samples, randomly select some negative samples for them.
How should I modify the code if I only add positive examples and do not add negative examples for fine-tuning?
This is the error reported during fine-tuning training when I set neg in the fine-tuning data to [] empty list.
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/jncbs/lht/paper_experiment/FlagEmbedding-master/FlagEmbedding/baai_general_embedding/finetune/run.py", line 111, in
main()
File "/home/jncbs/lht/paper_experiment/FlagEmbedding-master/FlagEmbedding/baai_general_embedding/finetune/run.py", line 102, in main
trainer.train()
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
Traceback (most recent call last):
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/jncbs/lht/paper_experiment/FlagEmbedding-master/FlagEmbedding/baai_general_embedding/finetune/run.py", line 111, in
main()
File "/home/jncbs/lht/paper_experiment/FlagEmbedding-master/FlagEmbedding/baai_general_embedding/finetune/run.py", line 102, in main
for step, inputs in enumerate(epoch_iterator):
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/accelerate/data_loader.py", line 384, in iter
trainer.train()
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
current_batch = next(dataloader_iter)
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
return inner_training_loop(
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jncbs/lht/paper_experiment/FlagEmbedding-master/FlagEmbedding/baai_general_embedding/finetune/data.py", line 52, in getitem
num = math.ceil((self.args.train_group_size - 1) / len(self.dataset[item]['neg']))
ZeroDivisionError: division by zero
for step, inputs in enumerate(epoch_iterator):
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/accelerate/data_loader.py", line 384, in iter
current_batch = next(dataloader_iter)
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jncbs/lht/paper_experiment/FlagEmbedding-master/FlagEmbedding/baai_general_embedding/finetune/data.py", line 52, in getitem
num = math.ceil((self.args.train_group_size - 1) / len(self.dataset[item]['neg']))
ZeroDivisionError: division by zero
0%| | 0/1912300 [00:00<?, ?it/s] bigdata-serve2:1105324:1105890 [0] NCCL INFO [Service thread] Connection closed by localRank 0 bigdata-serve2:1105324:1105324 [0] NCCL INFO comm 0xf4f50c0 rank 0 nranks 2 cudaDev 0 busId 4b000 - Abort COMPLETE bigdata-serve2:1105325:1105889 [1] NCCL INFO [Service thread] Connection closed by localRank 1 bigdata-serve2:1105325:1105325 [1] NCCL INFO comm 0xf6efc40 rank 1 nranks 2 cudaDev 1 busId b1000 - Abort COMPLETE ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1105324) of binary: /home/jncbs/anaconda3/envs/lht/bin/python Traceback (most recent call last): File "/home/jncbs/anaconda3/envs/lht/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jncbs/anaconda3/envs/lht/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
FlagEmbedding.baai_general_embedding.finetune.run FAILED
Failures: [1]: time : 2024-07-04_19:16:55 host : bigdata-serve2 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1105325) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-07-04_19:16:55 host : bigdata-serve2 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1105324) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html