facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF #5067

Closed LONNIESAN closed 1 year ago

LONNIESAN commented 1 year ago

how can i avoid this situation

the error like below

2023-07-28 23:14:34.331 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is Traceback (most recent call last): File "/root/anaconda3/bin/parlai", line 33, in sys.exit(load_entry_point('parlai', 'console_scripts', 'parlai')()) File "/root/ParlAI/parlai/main.py", line 14, in main superscript_main() File "/root/ParlAI/parlai/core/script.py", line 325, in superscript_main

2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser) File "/root/ParlAI/parlai/core/script.py", line 108, in _run_from_parser_and_opt return script.run() File "/root/ParlAI/parlai/scripts/interactive.py", line 118, in run

2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return interactive(self.opt) File "/root/ParlAI/parlai/scripts/interactive.py", line 93, in interactive

2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is world.parley() File "/root/ParlAI/parlai/tasks/interactive/worlds.py", line 89, in parley

2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is acts[1] = agents[1].act() File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1512, in act

2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is response = self.batch_act([self.observations])[0] File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1445, in batch_act

2023-07-28 23:14:34.332 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is batch_reply_knowledge = self.batch_act_knowledge( File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1003, in batch_act_knowledge

2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is batch_reply_mkm = batch_agents[Module.MEMORY_KNOWLEDGE].batch_act(mkm_obs) File "/root/ParlAI/parlai/core/torch_agent.py", line 2253, in batch_act

2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is output = self.eval_step(batch) File "/root/ParlAI/projects/seeker/agents/seeker.py", line 160, in eval_step

2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is output = TorchGeneratorAgent.eval_step(self, batch) File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 951, in eval_step

2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is beam_preds_scores, beams = self._generate( File "/root/ParlAI/parlai/agents/rag/rag.py", line 684, in _generate

2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is gen_outs = self._rag_generate(batch, beam_size, max_ts, prefix_tokens) File "/root/ParlAI/parlai/agents/rag/rag.py", line 727, in _rag_generate

2023-07-28 23:14:34.333 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return self._generation_agent._generate( File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 1237, in _generate

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is encoder_states = model.encoder(*self._encoder_input(batch)) File "/root/ParlAI/projects/seeker/agents/seeker_modules.py", line 244, in encoder

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is output = super().encoder( File "/root/ParlAI/parlai/agents/fid/fid.py", line 149, in encoder

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is enc_out, mask, input_turns_cnt, top_docs, top_doc_scores = super().encoder( File "/root/ParlAI/parlai/agents/rag/modules.py", line 200, in encoder

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is tensor, mask = self.seq2seq_encoder( File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return forward_call(*args, **kwargs) File "/root/ParlAI/parlai/agents/transformer/modules/encoder.py", line 363, in forward

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is tensor = self.forward_layers(tensor, mask) File "/root/ParlAI/parlai/agents/transformer/modules/encoder.py", line 300, in forward_layers

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is tensor = self.layers[i](tensor, mask) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return forward_call(*args, **kwargs) File "/root/anaconda3/lib/python3.8/site-packages/fairscale/nn/checkpoint/checkpoint_activations.py", line 171, in _checkpointed_forward

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return original_forward(module, *args, **kwargs) File "/root/ParlAI/parlai/agents/transformer/modules/encoder.py", line 89, in forward

2023-07-28 23:14:34.334 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is attended_tensor = self.attention(tensor, mask=mask)[0] File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

2023-07-28 23:14:34.335 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is return forward_call(*args, **kwargs) File "/root/ParlAI/parlai/agents/transformer/modules/attention.py", line 251, in forward

2023-07-28 23:14:34.335 INFO 3684554 --- [pool-1-thread-5] c.a.d.service.impl.VirtualServiceImpl : strRead is attn_weights = F.softmax( File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1845, in softmax alServiceImpl : strRead is ret = input.softmax(dim, dtype=dtype) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.25 GiB (GPU 0; 79.21 GiB total capacity; 12.62 GiB already allocated; 8.26 GiB free; 13.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

mojtaba-komeili commented 1 year ago

Looks like you are running out of GPU memory. The easiest answer is to use larger GPUs.

LONNIESAN commented 1 year ago

yes,i use A100 GPU,it has 80G memory and very expensive.is there any other solution to solve this problem.Like optimize code and other things.

LONNIESAN commented 1 year ago

i find the error is in follow code

def batch_act( self, observations: List[Dict[Union[str, Module], Message]] ) -> List[Message]: """ Full batch_act pipeline. :param observations: batchsize-length list of observations from self.observe

    :return reply:
        return batchsize-length list of final replies.
    """
    # First, determine whether we're searching or accessing memory
    all_memory: List[Dict[str, int]] = [o['raw']['memories'] for o in observations]
    try:
        memory_to_set = self.get_available_memories(observations)
        self.agents[Module.MEMORY_KNOWLEDGE].set_memory(memory_to_set)
        available_memory = self.agents[Module.MEMORY_KNOWLEDGE].get_memory()
    except AttributeError:
        # Gold Docs
        available_memory = [[]] * len(observations)
        pass
    batch_reply_sdm, search_indices = self.batch_act_decision(
        observations, Module.SEARCH_DECISION, self.agents[Module.SEARCH_DECISION]
    )
    batch_reply_mdm, memory_indices = self.batch_act_decision(
        observations, Module.MEMORY_DECISION, self.agents[Module.MEMORY_DECISION]
    )
    memory_indices = [i for i in memory_indices if available_memory[i]]
    if self.contextual_knowledge_decision is Decision.ALWAYS:
        contextual_indices = list(range(len(observations)))
    elif self.contextual_knowledge_decision is Decision.NEVER:
        contextual_indices = []
    else:
        assert self.contextual_knowledge_decision is Decision.COMPUTE
        contextual_indices = [
            i
            for i in list(range(len(observations)))
            if i not in memory_indices + search_indices
        ]

    # Second, generate search queries and new memories
    batch_reply_sgm = self.batch_act_sgm(
        observations, search_indices, self.agents[Module.SEARCH_QUERY]
    )
    batch_reply_mgm_partner = self.batch_act_mgm(
        observations=observations, agent=self.agents[Module.MEMORY_GENERATOR]
    )

    # Third, generate the knowledge sentences
    batch_reply_knowledge = self.batch_act_knowledge(
        observations,
        search_indices,
        memory_indices,
        contextual_indices,
        {m: self.agents[m] for m in Module if m.is_knowledge()},
    )

    # Fourth, generate the dialogue response!
    if self.knowledge_conditioning == 'combined':
        batch_reply_dialogue = self.batch_act_dialogue_combined(
            observations, batch_reply_knowledge
        )
    elif self.knowledge_conditioning == 'separate':
        batch_reply_dialogue = self.batch_act_dialogue_separate(
            observations,
            batch_reply_knowledge,
            search_indices,
            memory_indices,
            contextual_indices,
        )
    else:
        assert self.knowledge_conditioning == 'both'
        reply_combined = self.batch_act_dialogue_combined(
            observations, batch_reply_knowledge
        )
        self.reset(clones_only=True)
        reply_separate = self.batch_act_dialogue_separate(
            observations,
            batch_reply_knowledge,
            search_indices,
            memory_indices,
            contextual_indices,
        )
        batch_reply_dialogue = []
        for r_c, r_s in zip(reply_combined, reply_separate):
            reply = r_c
            reply_score = reply['beam_texts'][0][-1]
            max_seperate_score = r_s['max_score']
            if max_seperate_score > reply_score:
                reply.force_set('text', r_s['text'])
            batch_reply_dialogue.append(reply)

    # Fifth, generate new memories
    batch_reply_mgm_self = self.batch_act_mgm(
        self_messages=batch_reply_dialogue,
        agent=self.agents[Module.MEMORY_GENERATOR],
    )

    # Sixth, combine them all in the srm batch reply.
    final_batch_reply = self.collate_batch_acts(
        batch_reply_sdm,
        batch_reply_mdm,
        batch_reply_sgm,
        batch_reply_mgm_self,
        batch_reply_mgm_partner,
        batch_reply_knowledge,
        batch_reply_dialogue,
        all_memory,
    )

    return final_batch_reply

WeChat1b9e096342989eb5aa8cb1aa77250fe0

LONNIESAN commented 1 year ago

@klshuster i get an error

2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is world.parley() File "/root/ParlAI/parlai/tasks/interactive/worlds.py", line 89, in parley

2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is acts[1] = agents[1].act() File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1454, in act

2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is self.self_observe(response) File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1473, in self_observe

2023-08-30 21:19:36.374 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is self.self_observe_memory(self_message) File "/root/ParlAI/projects/bb3/agents/r2c2_bb3_agent.py", line 1525, in self_observe_memory

2023-08-30 21:19:36.375 INFO 267534 --- [pool-1-thread-6] c.a.d.service.impl.VirtualServiceImpl : strRead is self.memories = self.memory_utils.update_memory_usage( File "/root/ParlAI/projects/bb3/agents/utils.py", line 468, in update_memory_usage

what's the meaning of this

mojtaba-komeili commented 1 year ago

Sorry, I am very confused here, is this part of the first problem? Was that one solved? How did you use the agent to arrive at this error. What was the context?

github-actions[bot] commented 1 year ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.