JuliaPOMDP / POMCPOW.jl

Online solver based on Monte Carlo tree search for POMDPs with continuous state, action, and observation spaces.
Other
54 stars 17 forks source link

Need help in defining and solving a POMDP: Varying Mountainhike #43

Open sai-prasanna opened 1 week ago

sai-prasanna commented 1 week ago

I am working on a POMDP called VaryingMountainHike (which I have redubbed as MordorHike :wink: ). It was introduced here. It is basically a 2d environment with mountains defined by 3 gaussians. The agent has to get from bottom left to top right. The rewards are always negative, and it is least negative on top of the mountains.

I am totally new to julia and pomdp planners. So bear with me if I had made some bugs in the implementation.

import POMDPs
import POMDPModels
import Distributions

using POMDPs
using POMCPOW
using POMDPModels
using POMDPTools
using POMDPs
using POMDPTools
using LinearAlgebra
using Distributions
using Random

mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}
    # Constants
    translate_step::Float64
    translate_std::Float64
    obs_std::Float64
    discount::Float64

    # Map bounds and positions
    map_lower_bound::Vector{Float64}
    map_upper_bound::Vector{Float64}
    fixed_start_pos::Vector{Float64}
    goal_position::Vector{Float64}

    # Mountain parameters
    slope::Vector{Float64}
    mvn_1::MvNormal
    mvn_2::MvNormal
    mvn_3::MvNormal

    # Constructor
    function MordorHikePOMDP()
        new(
            0.1,    # translate_step
            0.05,   # translate_std
            0.1,    # obs_std
            0.99,   # discount
            [-1.0, -1.0],  # map_lower_bound
            [1.0, 1.0],    # map_upper_bound
            [-0.8, -0.8],  # fixed_start_pos
            [0.8, 0.8],    # goal_position
            [0.2, 0.2],    # slope
            MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]),    # mvn_1
            MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]),    # mvn_2
            MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01])      # mvn_3
        )
    end
end

# State: [x, y, θ]
# Action: 1=north, 2=south, 3=east, 4=west
# Observation: [altitude]

function POMDPs.actions(pomdp::MordorHikePOMDP)
    return [1, 2, 3, 4]
end

function POMDPs.initialstate(pomdp::MordorHikePOMDP)
    # Random rotation for medium difficulty
    return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))
end

function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    if isterminal(pomdp, s)
        0.0
    else
        return calculate_altitude(pomdp, sp[1:2])
    end
end
POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount

function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64})
    return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step
end

function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64)
    return ImplicitDistribution(function(rng)
        directions = Dict(
            1 => [0.0, 1.0],  # North
            2 => [0.0, -1.0], # South
            3 => [1.0, 0.0],  # East
            4 => [-1.0, 0.0]  # West
        )

        # Apply movement
        direction = directions[a]
        next_pos = s[1:2] + direction * pomdp.translate_step

        # Add noise to position
        next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)

        # Clip to bounds
        next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)

        # Keep the same theta
        return [next_pos[1], next_pos[2], s[3]]
    end)
end

function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64})
    altitude = calculate_altitude(pomdp, sp[1:2])
    return Normal(altitude, pomdp.obs_std)
end

function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    return observation(pomdp, a, sp)
end

function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})
    # Convert pos to a 2-element vector to ensure correct dimensionality
    pos_2d = pos[1:2]  # Take only x,y coordinates if pos is longer
    mountains = [
        pdf(pomdp.mvn_1, pos_2d),
        pdf(pomdp.mvn_2, pos_2d),
        pdf(pomdp.mvn_3, pos_2d)
    ]
    altitude = maximum(mountains)
    return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02
end

pomdp = MordorHikePOMDP()

# Use POMCPOW solver (as in your first code block)
using POMCPOW
solver = POMCPOWSolver(criterion=MaxUCB(20.0))
planner = solve(solver, pomdp)

# Simulate
hr = HistoryRecorder(max_steps=200,)
hist = simulate(hr, pomdp, planner)
for (s, b, a, r, sp, o) in hist
    @show s, a, r, sp
    println(isterminal(pomdp, s))
end

rhist = simulate(hr, pomdp, RandomPolicy(pomdp))
println("""
    Cumulative Discounted Reward (for 1 simulation)
        Random: $(discounted_reward(rhist))
        POMCPOW: $(discounted_reward(hist))
    """)

I need help on knowing what are the reasonable hyperparameters for the solver for this problem. When I used the default tree queries the rollout terminated before max steps without reaching the terminal state.

himanshugupta1009 commented 12 hours ago

Hey Sai,

I can help you with this. Let's start with a few questions I have.

1) Based on my understanding of your implementation, it doesn't seem like the state variable theta is important, as it remains unchanged throughout the simulation. Is that intentional?

2) Most tree solvers require a good rollout policy to guide the search toward high-value actions. Did you experiment with this? If not, I can help you implement one. I think a reasonable rollout policy would be to move directly to the goal location in a straight line (assuming full observability).

Let me know whenever you get a chance!

Edit: Also, I did some debugging, and it seems like the POMCPOW tree is reaching the terminal state during planning. Moreover, the simulation stops before executing max_steps because the agent reaches its goal which is a terminal state in your implementation.

sai-prasanna commented 11 hours ago
  1. ~The state variable theta is important, although it does not change in an episode, it is picked randomly among four choices. One could obtain a similar sub-sequence of observation for different start angles unless one takes information-seeking actions that reduce this sub-sequence length.~ There was a bug, I intended to make the directions different based on theta! Thanks for catching that. Here is the fixed code
    
    # import Pkg
    # Pkg.add("POMDPs")
    # Pkg.add("POMCPOW")
    # Pkg.add("POMDPModels")
    # Pkg.add("POMDPTools")
    # Pkg.add("Distributions")
    import POMDPs
    import POMDPModels
    import Distributions

using POMDPs using POMCPOW using POMDPModels using POMDPTools using POMDPs using POMDPTools using LinearAlgebra using Distributions using Random

mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}

Constants

translate_step::Float64
translate_std::Float64
obs_std::Float64
discount::Float64

# Map bounds and positions
map_lower_bound::Vector{Float64}
map_upper_bound::Vector{Float64}
fixed_start_pos::Vector{Float64}
goal_position::Vector{Float64}

# Mountain parameters
slope::Vector{Float64}
mvn_1::MvNormal
mvn_2::MvNormal
mvn_3::MvNormal

# Constructor
function MordorHikePOMDP()
    new(
        0.1,    # translate_step
        0.05,   # translate_std
        0.1,    # obs_std
        0.99,   # discount
        [-1.0, -1.0],  # map_lower_bound
        [1.0, 1.0],    # map_upper_bound
        [-0.8, -0.8],  # fixed_start_pos
        [0.8, 0.8],    # goal_position
        [0.2, 0.2],    # slope
        MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]),    # mvn_1
        MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]),    # mvn_2
        MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01])      # mvn_3
    )
end

end

State: [x, y, θ]

Action: 1=north, 2=south, 3=east, 4=west

Observation: [altitude]

function POMDPs.actions(pomdp::MordorHikePOMDP) return [1, 2, 3, 4] end

function POMDPs.initialstate(pomdp::MordorHikePOMDP)

Random rotation for medium difficulty

return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))

end

function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64}) if isterminal(pomdp, s) 0.0 else return calculate_altitude(pomdp, sp[1:2]) end end POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount

function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64}) return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step end

function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64) return ImplicitDistribution(function(rng) theta = s[3]

    # Create movement vector based on theta
    forward_vector = [cos(theta), sin(theta)]
    lateral_vector = [-sin(theta), cos(theta)]

    # Map actions to movement
    movement = if a == 1  # North
        forward_vector
    elseif a == 2  # South 
        -forward_vector
    elseif a == 3  # East
        lateral_vector
    else  # West
        -lateral_vector
    end

    next_pos = s[1:2] + movement * pomdp.translate_step

    # Add noise to position
    next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)

    # Clip to bounds
    next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)

    # Keep the same theta
    return [next_pos[1], next_pos[2], theta]
end)

end

function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64}) altitude = calculate_altitude(pomdp, sp[1:2]) return Normal(altitude, pomdp.obs_std) end

function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64}) return observation(pomdp, a, sp) end

function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})

Convert pos to a 2-element vector to ensure correct dimensionality

pos_2d = pos[1:2]  # Take only x,y coordinates if pos is longer
mountains = [
    pdf(pomdp.mvn_1, pos_2d),
    pdf(pomdp.mvn_2, pos_2d),
    pdf(pomdp.mvn_3, pos_2d)
]
altitude = maximum(mountains)
return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02

end

pomdp = MordorHikePOMDP()

Use POMCPOW solver (as in your first code block)

using POMCPOW solver = POMCPOWSolver(criterion=MaxUCB(20.0)) planner = solve(solver, pomdp)

Simulate

hr = HistoryRecorder(max_steps=200,) hist = simulate(hr, pomdp, planner) for (s, b, a, r, sp, o) in hist @show s, a, r, sp println(norm(s[1:2] - pomdp.goal_position)) end

rhist = simulate(hr, pomdp, RandomPolicy(pomdp)) println(""" Cumulative Discounted Reward (for 1 simulation) Random: $(discounted_reward(rhist)) POMCPOW: $(discounted_reward(hist)) """)


2. Hmm, I can implement the cross-entropy method ( a black box optimization) as a rollout policy or maybe even a greedy one that goes directly to the goal. Is there an example of how to specify this in the API?

When I ran the above code and checked the trajectory (hist), the final state did not satisfy the is_terminal condition of being near [0.8, 0.8]! Do you get a different result?  

If I try `POMCPOWSolver(criterion=MaxUCB(20.0), tree_queries=10000, max_depth=10)`  it reaches 200 steps without solving the task but better than random performance. 
sai-prasanna commented 10 hours ago

Ok, sorry for too many updates. I reran the solver on the updated code. It solves the task sometimes and sometimes it does not. I guess a non-random rollout policy would help here? Can you help me define it? I don't understand the API that well.

himanshugupta1009 commented 6 hours ago

Yeah, hopefully, a better rollout policy will increase the number of times it solves the task.

This would be the general layout for the rollout policy code. estimate_value is the parameter you use to specify the rollout policy/function.

struct GoToGoalPolicy{P} <: Policy
    pomdp::P
end

function POMDPs.action(p::GoToGoalPolicy, s)
  # ROLLOUT CODE TO RETURN AN ACTION FOR THE INPUT STATE s
end

rollout_policy = FORollout(GoToGoalPolicy(pomdp))

solver = POMCPOWSolver(
                    estimate_value=rollout_policy,
                    criterion=MaxUCB(20.0),
                    tree_queries=10_000, 
                    max_depth=10
                    )

Let me know how well it performs after using a better rollout.