Open natolambert opened 6 months ago
Inference works distributed, but I couldn't get the results gather working correctly with things like
state.wait_for_everyone()
# flatten results list of lists if is list of lists
if state.is_main_process:
logger.info("Gathering results")
results = gather_object(results) # gather() is for tensors
I have a distributed inference script for my own use case here, which might help.
The code assumes ds
is a Hugging Face dataset of the following structure:
Dataset({
features: ['chosen', 'rejected'],
num_rows: 1000
})
and each of chosen
and rejected
is a list of messages in the standard chat format.
The core part of the code is shown below:
accelerator.wait_for_everyone()
with accelerator.split_between_processes(ds) as ds_shard:
rewards = []
for sample in tqdm(ds_shard):
chosen_score, rejected_score = rm.get_score(sample["chosen"], sample["rejected"])
rewards.append({"chosen_reward": chosen_score, "rejected_reward": rejected_score})
rewards_gathered = accelerate.utils.gather_object([{"rewards": rewards}])
if accelerator.is_main_process:
all_rewards = [row for result in rewards_gathered for row in result["rewards"]]
The key steps are 1) iterate through the assigned ds_shard
for the current GPU, 2) gather rewards outside the context (but for all processes), and 3) collect all rewards only in the main process.
The final all_rewards
will be a list of dicts, like {"chosen_reward": 18.35, "rejected_reward": -7.89}
, each corresponding to the same order of the samples in ds
.
Currently
run_rm.py
only uses one RM because RMs are not well supported generally for inference. Current implementation is a separaterun_rm_mpgu.py
script. We can delete this and improve the base script if more use cases emerge.Closes #95