Closed LONNIESAN closed 1 year ago
Looks like you are running out of GPU memory. The easiest answer is to use larger GPUs.
yes,i use A100 GPU,it has 80G memory and very expensive.is there any other solution to solve this problem.Like optimize code and other things.
i find the error is in follow code
def batch_act( self, observations: List[Dict[Union[str, Module], Message]] ) -> List[Message]: """ Full batch_act pipeline. :param observations: batchsize-length list of observations from self.observe
:return reply:
return batchsize-length list of final replies.
"""
# First, determine whether we're searching or accessing memory
all_memory: List[Dict[str, int]] = [o['raw']['memories'] for o in observations]
try:
memory_to_set = self.get_available_memories(observations)
self.agents[Module.MEMORY_KNOWLEDGE].set_memory(memory_to_set)
available_memory = self.agents[Module.MEMORY_KNOWLEDGE].get_memory()
except AttributeError:
# Gold Docs
available_memory = [[]] * len(observations)
pass
batch_reply_sdm, search_indices = self.batch_act_decision(
observations, Module.SEARCH_DECISION, self.agents[Module.SEARCH_DECISION]
)
batch_reply_mdm, memory_indices = self.batch_act_decision(
observations, Module.MEMORY_DECISION, self.agents[Module.MEMORY_DECISION]
)
memory_indices = [i for i in memory_indices if available_memory[i]]
if self.contextual_knowledge_decision is Decision.ALWAYS:
contextual_indices = list(range(len(observations)))
elif self.contextual_knowledge_decision is Decision.NEVER:
contextual_indices = []
else:
assert self.contextual_knowledge_decision is Decision.COMPUTE
contextual_indices = [
i
for i in list(range(len(observations)))
if i not in memory_indices + search_indices
]
# Second, generate search queries and new memories
batch_reply_sgm = self.batch_act_sgm(
observations, search_indices, self.agents[Module.SEARCH_QUERY]
)
batch_reply_mgm_partner = self.batch_act_mgm(
observations=observations, agent=self.agents[Module.MEMORY_GENERATOR]
)
# Third, generate the knowledge sentences
batch_reply_knowledge = self.batch_act_knowledge(
observations,
search_indices,
memory_indices,
contextual_indices,
{m: self.agents[m] for m in Module if m.is_knowledge()},
)
# Fourth, generate the dialogue response!
if self.knowledge_conditioning == 'combined':
batch_reply_dialogue = self.batch_act_dialogue_combined(
observations, batch_reply_knowledge
)
elif self.knowledge_conditioning == 'separate':
batch_reply_dialogue = self.batch_act_dialogue_separate(
observations,
batch_reply_knowledge,
search_indices,
memory_indices,
contextual_indices,
)
else:
assert self.knowledge_conditioning == 'both'
reply_combined = self.batch_act_dialogue_combined(
observations, batch_reply_knowledge
)
self.reset(clones_only=True)
reply_separate = self.batch_act_dialogue_separate(
observations,
batch_reply_knowledge,
search_indices,
memory_indices,
contextual_indices,
)
batch_reply_dialogue = []
for r_c, r_s in zip(reply_combined, reply_separate):
reply = r_c
reply_score = reply['beam_texts'][0][-1]
max_seperate_score = r_s['max_score']
if max_seperate_score > reply_score:
reply.force_set('text', r_s['text'])
batch_reply_dialogue.append(reply)
# Fifth, generate new memories
batch_reply_mgm_self = self.batch_act_mgm(
self_messages=batch_reply_dialogue,
agent=self.agents[Module.MEMORY_GENERATOR],
)
# Sixth, combine them all in the srm batch reply.
final_batch_reply = self.collate_batch_acts(
batch_reply_sdm,
batch_reply_mdm,
batch_reply_sgm,
batch_reply_mgm_self,
batch_reply_mgm_partner,
batch_reply_knowledge,
batch_reply_dialogue,
all_memory,
)
return final_batch_reply
@klshuster i get an error
2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is world.parley() File "/root/ParlAI/parlai/tasks/interactive/worlds.py", line 89, in parley
2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is acts[1] = agents[1].act() File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1454, in act
2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is self.self_observe(response) File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1473, in self_observe
2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is self.self_observe_memory(self_message) File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1525, in self_observe_memory
2023-08-30 21:19:36.375 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is self.memories = self.memory_utils.update_memory_usage( File "/root/ParlAI/projects/bb3/agents/utils.py", line 468, in update_memory_usage
what's the meaning of this
Sorry, I am very confused here, is this part of the first problem? Was that one solved? How did you use the agent to arrive at this error. What was the context?
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.
how can i avoid this situation
the error like below
2023-07-28 23:14:34.331 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is Traceback (most recent call last): File "/root/anaconda3/bin/parlai", line 33, in
sys.exit(load_entry_point('parlai', 'console_scripts', 'parlai')())
File "/root/ParlAI/parlai/main.py", line 14, in main
superscript_main()
File "/root/ParlAI/parlai/core/script.py", line 325, in superscript_main
2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser) File "/root/ParlAI/parlai/core/script.py", line 108, in _run_from_parser_and_opt return script.run() File "/root/ParlAI/parlai/scripts/interactive.py", line 118, in run
2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return interactive(self.opt) File "/root/ParlAI/parlai/scripts/interactive.py", line 93, in interactive
2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is world.parley() File "/root/ParlAI/parlai/tasks/interactive/worlds.py", line 89, in parley
2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is acts[1] = agents[1].act() File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1512, in act
2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is response = self.batch_act([self.observations])[0] File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1445, in batch_act
2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is batch_reply_knowledge = self.batch_act_knowledge( File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1003, in batch_act_knowledge
2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is batch_reply_mkm = batch_agents[Module.MEMORY_KNOWLEDGE].batch_act(mkm_obs) File "/root/ParlAI/parlai/core/torch_agent.py", line 2253, in batch_act
2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is output = self.eval_step(batch) File "/root/ParlAI/projects/seeker/agents/seeker.py", line 160, in eval_step
2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is output = TorchGeneratorAgent.eval_step(self, batch) File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 951, in eval_step
2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is beam_preds_scores, beams = self._generate( File "/root/ParlAI/parlai/agents/rag/rag.py", line 684, in _generate
2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is gen_outs = self._rag_generate(batch, beam_size, max_ts, prefix_tokens) File "/root/ParlAI/parlai/agents/rag/rag.py", line 727, in _rag_generate
2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return self._generation_agent._generate( File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 1237, in _generate
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is encoder_states = model.encoder(*self._encoder_input(batch)) File "/root/ParlAI/projects/seeker/agents/seeker_modules.py", line 244, in encoder
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is output = super().encoder( File "/root/ParlAI/parlai/agents/fid/fid.py", line 149, in encoder
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is enc_out, mask, input_turns_cnt, top_docs, top_doc_scores = super().encoder( File "/root/ParlAI/parlai/agents/rag/modules.py", line 200, in encoder
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is tensor, mask = self.seq2seq_encoder( File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return forward_call(*args, **kwargs) File "/root/ParlAI/parlai/agents/transformer/modules/encoder.py", line 363, in forward
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is tensor = self.forward_layers(tensor, mask) File "/root/ParlAI/parlai/agents/transformer/modules/encoder.py", line 300, in forward_layers
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is tensor = self.layers[i](tensor, mask) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return forward_call(*args, **kwargs) File "/root/anaconda3/lib/python3.8/site-packages/fairscale/nn/checkpoint/checkpoint_activations.py", line 171, in _checkpointed_forward
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return original_forward(module, *args, **kwargs) File "/root/ParlAI/parlai/agents/transformer/modules/encoder.py", line 89, in forward
2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is attended_tensor = self.attention(tensor, mask=mask)[0] File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2023-07-28 23:14:34.335 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return forward_call(*args, **kwargs) File "/root/ParlAI/parlai/agents/transformer/modules/attention.py", line 251, in forward
2023-07-28 23:14:34.335 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is attn_weights = F.softmax( File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1845, in softmax alServiceImpl : strRead is ret = input.softmax(dim, dtype=dtype) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.25 GiB (GPU 0; 79.21 GiB total capacity; 12.62 GiB already allocated; 8.26 GiB free; 13.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF