Visualization part of the tutorial 'Using a pre-trained model' has error

Hustwireless commented 2 years ago

Problem

When running this command following the tutorial

PYTHONPATH=. python allenact/main.py \ running_inference_tutorial \ -o pretrained_model_ckpts/robothor-pointnav-rgb-resnet/ \ -b projects/tutorials \ -c pretrained_model_ckpts/robothor-pointnav-rgb-resnet/checkpoints/PointNavRobothorRGBPPO/2020-08-31_12-13-30/exp_PointNavRobothorRGBPPO__stage_00__steps_000039031200.pt \ --eval

This error occurs

[01/04 05:49:16 ERROR:] Encountered Exception. Terminating test worker 0 [engine.py: 1818] [01/04 05:49:16 ERROR:] Traceback (most recent call last): File "/Documents/github/allenact/allenact/algorithms/onpolicy_sync/engine.py", line 1793, in process_checkpoints update_secs=20 if self.mode == TEST_MODE_STR else 5 * 60, File "allenact/allenact/algorithms/onpolicy_sync/engine.py", line 1634, in run_eval rollouts, visualizer=visualizer, dist_wrapper_class=dist_wrapper_class File "allenact/allenact/algorithms/onpolicy_sync/engine.py", line 596, in collect_rollout_step actor_critic=actor_critic_output, File "allenact/allenact/utils/viz_utils.py", line 986, in collect self._collect_rollout(rollout, alive) File "allenact/allenact/utils/viz_utils.py", line 883, in _collect_rollout tuple(path) KeyError: ('rnn',) [engine.py: 1821]

Desktop

OS: [Ubuntu 20.04]
AllenAct Version: [current HEAD of master]

Lucaweihs commented 2 years ago

Hi @Hustwireless,

Thanks for catching this! We've recently updated a few model architectures which made the "rnn" tag out of date for this model. I've just pushed a commit that should fix this error, let me know if this works for you.

Hustwireless commented 2 years ago

Hi @Lucaweihs,

Thanks for the fix! It works now!

Since I've just started to use this framework, many things are not quite clear to me. I'm a little bit curious about why adding this rollout_source makes it work? Could you briefly point out what is the functionality of these tags in rollout_source?

Much appreciated!

Lucaweihs commented 2 years ago

Hi @Hustwireless,

Sure thing, for reference, here's the piece of code from the experiment configuration file.

self.viz = VizSuite(
    episode_ids=self.viz_ep_ids,
    mode=mode,
    # Basic 2D trajectory visualizer (task output source):
    base_trajectory=TrajectoryViz(
        path_to_target_location=("task_info", "target",),
    ),
    # Egocentric view visualizer (vector task source):
    egeocentric=AgentViewViz(
        max_video_length=100, episode_ids=self.viz_video_ids
    ),
    # Default action probability visualizer (actor critic output source):
    action_probs=ActorViz(figsize=(3.25, 10), fontsize=18),
    # Default taken action logprob visualizer (rollout storage source):
    taken_action_logprobs=TensorViz1D(),
    # Same episode mask visualizer (rollout storage source):
    episode_mask=TensorViz1D(rollout_source=("masks",)),
    # Default recurrent memory visualizer (rollout storage source):
    rnn_memory=TensorViz2D(rollout_source=("memory", "single_belief")),
    # Specialized 2D trajectory visualizer (task output source):
    thor_trajectory=ThorViz(
        figsize=(16, 8),
        viz_rows_cols=(448, 448),
        scenes=("FloorPlan_Train{}_{}", 1, 1, 1, 1),
    ),
)

What this piece of code is doing is instantiating a class that will handle visualizing various metrics during training (in particular, saving these visualizations to a tensorboard log). For instance, the thor_trajectory=ThorViz(...) code will result in generating a top-down visualizations of agent's trajectory (see the visualizations with "trajectory" in their label at the bottom of the tutorial).

Now the piece of code that was causing the problem was rnn_memory=TensorViz2D() which is meant to (1) take the hidden belief state from the agent (i.e. it's representation of the environment, in this case the 512-dimensional output from agent's GRU) at every step in an episode, (2) concatenate all of these hidden states into a T x 512 dimensional matrix (where T is the number of steps the agent took in an episode), and then (3) creates a heatmap from this matrix. This allows you to get a sense of how the hidden state of the agent changes during training (e.g. see the four heatmaps at the bottom of the tutorial with label test/memory/rnn_group0).

Now to be able to get the belief state from the agent to the visualizer during training, we need to tell the visualizer where to look for it. What adding rollout_source=("memory", "single_belief") is doing is telling the visualizer that it should look into the agent's rollout (i.e. just the history of its state/actions) and pick out the "single_belief" key from the agent's "memory". The reason this code broke is that the architecture we use for this task (see here) has changed and "single_belief" used to be called "rnn".

This type of visualization code is definitely an "advanced" topic in AllenAct, even I generally just use the default tensorboard graphs that are generated without specifying any custom visualizers.

Let me know if that helps or if you have any other questions.

Hustwireless commented 2 years ago

Hi @Lucaweihs, thanks for this detailed walk through, it's super clear and helpful!

allenai / allenact

Visualization part of the tutorial 'Using a pre-trained model' has error #325

Problem

Desktop