Open gabrielhuang opened 2 years ago
Hey, this is expected I think. Keep in mind we have three t5 models loaded into memory. policy, value and reference policy. So far, we have used 4 GPUs to run summarization tasks. If you have just one GPU, just try to reduce the n_envs
to a lower value. Also reduce the batch size of PPO. Otherwise, I would suggest running with more GPUs if possible.
I see thanks. I didn't expect there to be three full models. Any advantages vs. plugging three heads onto one language model trunk?
Also, I'm just curious, does the n_envs
parameter scale up the batch size? Why does it have influence on GPU memory?
Many thanks.
yes people seem to usually just have different heads
@gabrielhuang Sure, shared layers will be paramter efficient. I am not sure how much performance change it will bring in.
Regarding n_envs
, it controls batch_size
for generating rollouts. You can think of it as a batched generation.
I'm trying to get google/flan-t5-xxl
to run with a single A100 80GB gpu, for seq2seq policy.
Is there already a way to set the precision to bfloat16? (I don't see one, but just to be sure) If not I'll write a policy for that.
Also, will try to add model sharing between the policy and the value models, and allowing to freeze parts of the model.
Enabling offloading a model from GPU memory to CPU memory when it's not in use would likely be helpful too.
@gabrielhuang have you started doing work like this? (I'm also at Mila)
@JulesGM We don't have support for precision setting yet. You can implement this in a new policy and possibly we can try to merge this into existing classes (by configuring some args)
This is my current approach, indeed, just allowing the user to pass kwargs for from_pretrained
and Linear
.
Passing torch_dtype
to from_pretrained
and dtype
to Linear
works.
I suppose adding amp mixed precision auto-casting and gradient scaling would be a good / important idea though.
import copy
import sys
import torch
import transformers
sys.path.append("/home/mila/g/gagnonju/RL4LMs")
import rl4lms.envs.text_generation.registry as rl4lms_registry
import rl4lms.envs.text_generation.policy.seq2seq_policy as rl4lms_seq2seq_policy
from rl4lms.envs.text_generation import hf_generation_utils
class PrecisionControlSeq2SeqLMActorCriticPolicy(rl4lms_seq2seq_policy.Seq2SeqLMActorCriticPolicy):
def __init__(
self,
*args,
from_pretrained_kwargs,
head_kwargs,
**kwargs,
):
self._from_pretrained_kwargs = from_pretrained_kwargs
self._head_kwargs = head_kwargs
super().__init__(*args, **kwargs)
def _build_model_heads(self, model_name: str):
self._policy_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name)
self._policy_model.__class__ = hf_generation_utils.override_generation_routines(
type(self._policy_model)
)
self._value_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(
model_name, **self._from_pretrained_kwargs)
self._ref_model = copy.deepcopy(self._policy_model).eval()
self._value_head = torch.nn.Linear(
self._value_model.config.hidden_size, 1, bias=False, **self._head_kwargs,
).to(self.device)
# apply model parallel
if torch.cuda.is_available():
if self._apply_model_parallel and self._policy_model.is_parallelizable:
self._policy_model.parallelize()
self._ref_model.parallelize()
self._value_model.parallelize()
self._value_head = self._value_head.to(self.device)
else: # else defaults to data parallel
self._policy_model = torch.nn.DataParallel(self._policy_model.to(self.device))
self._ref_model = torch.nn.DataParallel(self._ref_model .to(self.device))
self._value_model = torch.nn.DataParallel(self._value_model .to(self.device))
self._value_head = torch.nn.DataParallel(self._value_head .to(self.device))
rl4lms_registry.PolicyRegistry.add(
"precision_control_seq2seq_lm_actor_critic",
PrecisionControlSeq2SeqLMActorCriticPolicy,
)
looks like stable baselines 3 doesn't support bfloat16, because of all the a_tensor_name.cpu().numpy()
calls. Indeed, doing that with a bfloat16
tensor leads to an exception, because torch tries to build a numpy array with the bfloat16 dtype, which is not supported by Numpy
in order for baselines 3 (and then rl4lms) to support bfloat16, it would suffice to modify a_tensor_name.cpu().numpy()
to a_tensor_name.cpu().float().numpy()
.
This is my current approach, indeed, just allowing the user to pass kwargs for
from_pretrained
andLinear
. Passingtorch_dtype
tofrom_pretrained
anddtype
toLinear
works.I suppose adding amp mixed precision auto-casting and gradient scaling would be a good / important idea though.
import copy import sys import torch import transformers sys.path.append("/home/mila/g/gagnonju/RL4LMs") import rl4lms.envs.text_generation.registry as rl4lms_registry import rl4lms.envs.text_generation.policy.seq2seq_policy as rl4lms_seq2seq_policy from rl4lms.envs.text_generation import hf_generation_utils class PrecisionControlSeq2SeqLMActorCriticPolicy(rl4lms_seq2seq_policy.Seq2SeqLMActorCriticPolicy): def __init__( self, *args, from_pretrained_kwargs, head_kwargs, **kwargs, ): self._from_pretrained_kwargs = from_pretrained_kwargs self._head_kwargs = head_kwargs super().__init__(*args, **kwargs) def _build_model_heads(self, model_name: str): self._policy_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name) self._policy_model.__class__ = hf_generation_utils.override_generation_routines( type(self._policy_model) ) self._value_model = transformers.AutoModelForSeq2SeqLM.from_pretrained( model_name, **self._from_pretrained_kwargs) self._ref_model = copy.deepcopy(self._policy_model).eval() self._value_head = torch.nn.Linear( self._value_model.config.hidden_size, 1, bias=False, **self._head_kwargs, ).to(self.device) # apply model parallel if torch.cuda.is_available(): if self._apply_model_parallel and self._policy_model.is_parallelizable: self._policy_model.parallelize() self._ref_model.parallelize() self._value_model.parallelize() self._value_head = self._value_head.to(self.device) else: # else defaults to data parallel self._policy_model = torch.nn.DataParallel(self._policy_model.to(self.device)) self._ref_model = torch.nn.DataParallel(self._ref_model .to(self.device)) self._value_model = torch.nn.DataParallel(self._value_model .to(self.device)) self._value_head = torch.nn.DataParallel(self._value_head .to(self.device)) rl4lms_registry.PolicyRegistry.add( "precision_control_seq2seq_lm_actor_critic", PrecisionControlSeq2SeqLMActorCriticPolicy, )
hey, thanks for your solution!
I do have one question - is there a reason you didn't pass the from_pretrained_kwargs to the intialization of _policy_model
:
self._policy_model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name)
@JulesGM
@JulesGM
Hey, so I tried what you suggested, passing to the from_pretrained
**{"torch_dtype":torch.float16}
and to the Linear
**{"dtype": torch.float16} , and I got the following error:
Traceback (most recent call last):
File "/home/nlp/sloboda1/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/nlp/sloboda1/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/nlp/sloboda1/.vscode-server/extensions/ms-python.python-2022.20.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 39, in
I even tried passing **{"torch_dtype":torch.float16}
to the from_pretrained
of the self._poliy_model
and still got that error.
Is there anythying else you converted to FP16 by any chance?
yes I did a bunch of other changes in the end
May I ask if there is a complete code with changes that I can learn from?
Hi there, I'm having OOM errors when running the summarization example on a 80GB A100 (CUDA 11.8).
I'm also getting some Tensorflow/TensorRT warnings, I'm wondering if it's related to that
OOM error:
Any clues what's the issue? 80GB seems like a lot for just a T5-base model