Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
305 stars 31 forks source link

Dreamer-v3 's probs_2d sample issue and torch version? #147

Closed hanshuo-shuo closed 8 months ago

hanshuo-shuo commented 11 months ago

I'm using dreamer-v3 and it works pretty well at first. But when I do the following thing to the env:

This kind of error happens more frequently

image

I wonder if that is the nature of the dreamer-v3 or any bug with PyTorch itself?

belerico commented 11 months ago

Hi @hanshuo-shuo, thanks for reporting this. For this

I also have a thing to discuss about this part and I have no idea why:

image

To run your code on GPU device, I first try pip install -e. But when I try to run the task, they keep telling me that image

But when I change the torch==2.0.* into torch==2.1.* and reinstall your sheeprl, Then the error won't happen.

I'm quite curious because I was bothered by this problem for a day and I wonder if this is a typo or a problem with my gpu device only?

I have to ask you to open another issue, so that we can check that separetely from the one you've mentioned in this issue.

Regarding this instead:

I'm using dreamer-v3 and it works pretty well at first. But when I do the following thing to the env:

  • Increase the reward scale
  • Increase the sequence length of the agent
  • Increase the difficulty of the env

This kind of error happens more frequently image I wonder if that is the nature of the dreamer-v3 or any bug with PyTorch itself?

Could you give us more detail on the env, the sheeprl version you're using and if you're training with 16bit precision. Thanks

hanshuo-shuo commented 11 months ago

@belerico Thanks for your quick reply.

For my env, I'm using a custom env built with gymnasium:

class Environment(Env):
    metadata = {"render_modes": ["human", "rgb_array"]}
    def __init__(self,
                 e: int = 3,
                 freq: int = 100,
                 has_predator = True,
                 real_time: bool = False,
                 prey_agent: Agent = None,
                 max_step: int = 300,
                 predator_speed: float = 0.5,
                 env_type: str = "train",
                 env_random: bool = False,
                 penalty: int = -1,
                 reward: int = 1,
                 render_mode = None,
                 action_noise: bool = False):
        if env_type == "train":
            world_name = "%02i_%02i" % (random.randint(0, 10), e)
        elif env_type == "test":
            world_name = "%02i_%02i" % (random.randint(11, 19), e)
        self.freq = freq
        self.penalty = penalty
        self.reward = reward
        self.real_time = real_time
        self.prey_agent = prey_agent
        self.env_type = env_type
        self.env_random = env_random
        self.action_noise = action_noise
        self.e = e
        self.world = World.get_from_parameters_names("hexagonal", "canonical", world_name)
        self.model = Model(pworld=self.world, freq=self.freq, real_time=self.real_time)
        self.goal_location = Location(1, .5)
        self.start_location = Location(0, .5)
        self.observation_space = spaces.Box(-np.inf, np.inf, (14,), dtype=np.float32)
        self.action_space = spaces.Discrete(100)
        self.has_predator = has_predator
        self.max_step = max_step
        self.current_step = 0
        self.episode_reward_history = []
        self.current_episode_reward = 0
        self.predator = None
        self.predator_speed = predator_speed
        self.goal_threshold = self.world.implementation.cell_transformation.size
        self.capture_threshold = self.world.implementation.cell_transformation.size
        self.goal_area = self.model.display.circle(location=self.goal_location,
                                                   color="g",
                                                   alpha=.5,
                                                   radius=self.goal_threshold)

I'm using sheeprl = 0.4.4. I did notice you updated dreamer-v3 afterward, but my previous checkpoints won't work because there is no bias part. And the dreamer-v3 performs well so I keep using the older version.

And I'm training with:

You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

Thanks a lot, It a long message~

belerico commented 11 months ago

It is strange: we have trained a lot of agents on different envs with different reward scales and floating point precision (16 and 32). Can you please run the script with the detect_anomaly from the autograd package as detailed here? Can you also enable the error_if_nonfinite in every call of the fabric.clip_gradients method?

hanshuo-shuo commented 11 months ago

@belerico I'm sorry, I have quite limited experience doing this kind of detection. Could you say a bit more about how I can run the script with detect_anomaly and also error_if_nonfinite for the fabric.clip_gradients? Is it to modify your origin code or there is an easier way to implement those?

I also noticed you changed the dreamer-v3 code during the past few weeks, I wonder if that will help solve my issue. I will try the newest version tomorrow.

belerico commented 10 months ago

Hi @hanshuo-shuo, sorry for the late response! So:

Could you say a bit more about how I can run the script with detect_anomaly

This is what you can try to do:

from torch import autograd

# Copy the rest of the file here, but modify the run() method like the following

@hydra.main(version_base="1.3", config_path="configs", config_name="config")
def run(cfg: DictConfig):
    """SheepRL zero-code command line utility."""
    print_config(cfg)
    cfg = dotdict(OmegaConf.to_container(cfg, resolve=True, throw_on_missing=True))
    if cfg.checkpoint.resume_from:
        cfg = resume_from_checkpoint(cfg)
    check_configs(cfg)
    with autograd.detect_anomaly():
        run_algorithm(cfg)

The detect_anomaly will run the forward pass with detection enabled allowing the backward pass to print the traceback of the forward operation that created the failing backward function; so you can check if there's anything else that generates the nan in your training.

error_if_nonfinite for the fabric.clip_gradients?

You have to look in the dreamer_v3.py file for the method fabric.clip_gradients and change keyword argument error_if_nonfinite from False to True.

I suggest to run the detect_anomaly without changing the error_if_nonfinite first and check if something went wrong, then check for something wrong in the gradients.

It would also be helpful if you can share some plots for the gradients and the losses that you can get from Tensoboard by running tensorboard --logdir logs/runs/dreamer_v3

I also noticed you changed the dreamer-v3 code during the past few weeks, I wonder if that will help solve my issue. I will try the newest version tomorrow.

There was a change that disables the bias of all linear layers followed by a LayerNorm, but we have also trained a lot of models without that change and get the model to converge, at least on Crafter, Atari and DMC. Have you tried with that change?

hanshuo-shuo commented 10 months ago

Hi, @belerico No worries, I appreciate this and you can reply to me whenever it is convenient for you. Here are some plots in my training: (Now, when the error happens, I just resume it from the check point)

image image image

I also checked before, I thought this was so normal. So it is weird to me. I will do the detect_anomaly as soon as possible.

belerico commented 10 months ago

Hi @hanshuo-shuo, sorry for my late response! Those plots seem normal to me. Are you training in 32 or 16 bit precision? You can check this from the fabric.precision inside the fabric config you're using:

# Content of sheeprl/configs/fabric/default.yaml

_target_: lightning.fabric.Fabric
devices: 1
num_nodes: 1
strategy: "auto"
accelerator: "cpu"
precision: "32-true" # <-- what are you using here?
callbacks:
  - _target_: sheeprl.utils.callback.CheckpointCallback

I'm wondering if it is a problem of the GRU model exploding/vanishing gradients... :thinking:

EDIT: What sequence length and batch size are you using?

saurinej commented 9 months ago

@belerico @hanshuo-shuo I know this issue is a little stale but I don't see any fix for this yet so I will put my findings here since I have been encountering what I think is the same issue.

For me, the nans pop up because the actor network parameters are nan. I believe the root issue is the continues are nan in the train function -> policy gradient being nan -> actor parameters being nan -> the error seen in the first post. The issue starts here with the following code:

    continues = Independent(
        Bernoulli(logits=world_model.continue_model(imagined_trajectories), validate_args=validate_args),
        1,
        validate_args=validate_args,
    ).mode

When the logits from the continue model are equal to 0, the probabilities in the torch Bernoulli distribution of course are 0.5. This causes nans when the mode function of the Bernoulli distribution is called since the mode function specifically sets any mode values to nan where probs are 0.5, see here:

    @property
    def mode(self):
        mode = (self.probs >= 0.5).to(self.probs)
        mode[self.probs == 0.5] = nan
        return mode

The error would occur on approximately half of my runs. I fixed it by checking if the logits of the Bernoulli distributions were between -eps and eps and then changing them to be just above eps but you could also check for nans in the continues and replace them there as well I believe. Or subclass the Bernoulli distribution and override the mode function so that it does not set any values to nan.

belerico commented 9 months ago

Hi @saurinej, thank you very much to report this! How does the training goes after your fix? A quick fix, as you suggested, could be to add a small eps where the logits are 0. What i'm asking is why is this happening in the first place?

belerico commented 9 months ago

It seems that Tensorflow computes the mode in a different way:

  def _mode(self):
    """Returns `1` if `prob > 0.5` and `0` otherwise."""
    return tf.cast(self._probs_parameter_no_checks() > 0.5, self.dtype)

So, as suggested by @saurinej i would prefer this one. What do you think @michele-milesi @DavideTr8?

michele-milesi commented 9 months ago

Hi there. Yeah, I prefer the TF implementation. @belerico What do you have in mind? Create a custom Bernoulli distribution that inherits from the torch.distributions.Bernoulli class and overrides the mode property?

belerico commented 9 months ago

Yeah, exactly!

belerico commented 9 months ago

Could you pls try out this branch?