Closed hanshuo-shuo closed 8 months ago
Hi @hanshuo-shuo, thanks for reporting this. For this
I also have a thing to discuss about this part and I have no idea why:
To run your code on GPU device, I first try
pip install -e.
But when I try to run the task, they keep telling me thatBut when I change the
torch==2.0.*
intotorch==2.1.*
and reinstall your sheeprl, Then the error won't happen.I'm quite curious because I was bothered by this problem for a day and I wonder if this is a typo or a problem with my gpu device only?
I have to ask you to open another issue, so that we can check that separetely from the one you've mentioned in this issue.
Regarding this instead:
I'm using dreamer-v3 and it works pretty well at first. But when I do the following thing to the env:
- Increase the reward scale
- Increase the sequence length of the agent
- Increase the difficulty of the env
This kind of error happens more frequently I wonder if that is the nature of the dreamer-v3 or any bug with PyTorch itself?
Could you give us more detail on the env, the sheeprl version you're using and if you're training with 16bit precision. Thanks
@belerico Thanks for your quick reply.
For my env, I'm using a custom env built with gymnasium:
class Environment(Env):
metadata = {"render_modes": ["human", "rgb_array"]}
def __init__(self,
e: int = 3,
freq: int = 100,
has_predator = True,
real_time: bool = False,
prey_agent: Agent = None,
max_step: int = 300,
predator_speed: float = 0.5,
env_type: str = "train",
env_random: bool = False,
penalty: int = -1,
reward: int = 1,
render_mode = None,
action_noise: bool = False):
if env_type == "train":
world_name = "%02i_%02i" % (random.randint(0, 10), e)
elif env_type == "test":
world_name = "%02i_%02i" % (random.randint(11, 19), e)
self.freq = freq
self.penalty = penalty
self.reward = reward
self.real_time = real_time
self.prey_agent = prey_agent
self.env_type = env_type
self.env_random = env_random
self.action_noise = action_noise
self.e = e
self.world = World.get_from_parameters_names("hexagonal", "canonical", world_name)
self.model = Model(pworld=self.world, freq=self.freq, real_time=self.real_time)
self.goal_location = Location(1, .5)
self.start_location = Location(0, .5)
self.observation_space = spaces.Box(-np.inf, np.inf, (14,), dtype=np.float32)
self.action_space = spaces.Discrete(100)
self.has_predator = has_predator
self.max_step = max_step
self.current_step = 0
self.episode_reward_history = []
self.current_episode_reward = 0
self.predator = None
self.predator_speed = predator_speed
self.goal_threshold = self.world.implementation.cell_transformation.size
self.capture_threshold = self.world.implementation.cell_transformation.size
self.goal_area = self.model.display.circle(location=self.goal_location,
color="g",
alpha=.5,
radius=self.goal_threshold)
I'm using sheeprl = 0.4.4
. I did notice you updated dreamer-v3 afterward, but my previous checkpoints won't work because there is no bias part. And the dreamer-v3 performs well so I keep using the older version.
And I'm training with:
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Thanks a lot, It a long message~
It is strange: we have trained a lot of agents on different envs with different reward scales and floating point precision (16 and 32).
Can you please run the script with the detect_anomaly
from the autograd package as detailed here?
Can you also enable the error_if_nonfinite
in every call of the fabric.clip_gradients
method?
@belerico I'm sorry, I have quite limited experience doing this kind of detection. Could you say a bit more about how I can run the script with detect_anomaly
and also error_if_nonfinite
for the fabric.clip_gradients
? Is it to modify your origin code or there is an easier way to implement those?
I also noticed you changed the dreamer-v3 code during the past few weeks, I wonder if that will help solve my issue. I will try the newest version tomorrow.
Hi @hanshuo-shuo, sorry for the late response! So:
Could you say a bit more about how I can run the script with
detect_anomaly
This is what you can try to do:
from torch import autograd
# Copy the rest of the file here, but modify the run() method like the following
@hydra.main(version_base="1.3", config_path="configs", config_name="config")
def run(cfg: DictConfig):
"""SheepRL zero-code command line utility."""
print_config(cfg)
cfg = dotdict(OmegaConf.to_container(cfg, resolve=True, throw_on_missing=True))
if cfg.checkpoint.resume_from:
cfg = resume_from_checkpoint(cfg)
check_configs(cfg)
with autograd.detect_anomaly():
run_algorithm(cfg)
The detect_anomaly
will run the forward pass with detection enabled allowing the backward pass to print the traceback of the forward operation that created the failing backward function; so you can check if there's anything else that generates the nan
in your training.
error_if_nonfinite for the fabric.clip_gradients?
You have to look in the dreamer_v3.py
file for the method fabric.clip_gradients
and change keyword argument error_if_nonfinite
from False
to True
.
I suggest to run the detect_anomaly
without changing the error_if_nonfinite
first and check if something went wrong, then check for something wrong in the gradients.
It would also be helpful if you can share some plots for the gradients and the losses that you can get from Tensoboard by running tensorboard --logdir logs/runs/dreamer_v3
I also noticed you changed the dreamer-v3 code during the past few weeks, I wonder if that will help solve my issue. I will try the newest version tomorrow.
There was a change that disables the bias of all linear layers followed by a LayerNorm
, but we have also trained a lot of models without that change and get the model to converge, at least on Crafter, Atari and DMC. Have you tried with that change?
Hi, @belerico No worries, I appreciate this and you can reply to me whenever it is convenient for you. Here are some plots in my training: (Now, when the error happens, I just resume it from the check point)
I also checked before, I thought this was so normal. So it is weird to me. I will do the detect_anomaly as soon as possible.
Hi @hanshuo-shuo, sorry for my late response! Those plots seem normal to me.
Are you training in 32 or 16 bit precision? You can check this from the fabric.precision
inside the fabric
config you're using:
# Content of sheeprl/configs/fabric/default.yaml
_target_: lightning.fabric.Fabric
devices: 1
num_nodes: 1
strategy: "auto"
accelerator: "cpu"
precision: "32-true" # <-- what are you using here?
callbacks:
- _target_: sheeprl.utils.callback.CheckpointCallback
I'm wondering if it is a problem of the GRU model exploding/vanishing gradients... :thinking:
EDIT: What sequence length and batch size are you using?
@belerico @hanshuo-shuo I know this issue is a little stale but I don't see any fix for this yet so I will put my findings here since I have been encountering what I think is the same issue.
For me, the nans pop up because the actor network parameters are nan
. I believe the root issue is the continues are nan
in the train function ->
policy gradient being nan
->
actor parameters being nan
->
the error seen in the first post. The issue starts here with the following code:
continues = Independent(
Bernoulli(logits=world_model.continue_model(imagined_trajectories), validate_args=validate_args),
1,
validate_args=validate_args,
).mode
When the logits from the continue model are equal to 0
, the probabilities in the torch Bernoulli distribution of course are 0.5
. This causes nans when the mode function of the Bernoulli distribution is called since the mode function specifically sets any mode values to nan
where probs are 0.5
, see here:
@property
def mode(self):
mode = (self.probs >= 0.5).to(self.probs)
mode[self.probs == 0.5] = nan
return mode
The error would occur on approximately half of my runs. I fixed it by checking if the logits of the Bernoulli distributions were between -eps
and eps
and then changing them to be just above eps
but you could also check for nans
in the continues and replace them there as well I believe. Or subclass the Bernoulli distribution and override the mode function so that it does not set any values to nan
.
Hi @saurinej, thank you very much to report this! How does the training goes after your fix? A quick fix, as you suggested, could be to add a small eps
where the logits are 0.
What i'm asking is why is this happening in the first place?
It seems that Tensorflow computes the mode in a different way:
def _mode(self):
"""Returns `1` if `prob > 0.5` and `0` otherwise."""
return tf.cast(self._probs_parameter_no_checks() > 0.5, self.dtype)
So, as suggested by @saurinej i would prefer this one. What do you think @michele-milesi @DavideTr8?
Hi there.
Yeah, I prefer the TF implementation.
@belerico What do you have in mind? Create a custom Bernoulli distribution that inherits from the torch.distributions.Bernoulli
class and overrides the mode
property?
Yeah, exactly!
Could you pls try out this branch?
I'm using dreamer-v3 and it works pretty well at first. But when I do the following thing to the env:
This kind of error happens more frequently
I wonder if that is the nature of the dreamer-v3 or any bug with PyTorch itself?