Open sai-prasanna opened 1 week ago
Hey Sai,
I can help you with this. Let's start with a few questions I have.
1) Based on my understanding of your implementation, it doesn't seem like the state variable theta is important, as it remains unchanged throughout the simulation. Is that intentional?
2) Most tree solvers require a good rollout policy to guide the search toward high-value actions. Did you experiment with this? If not, I can help you implement one. I think a reasonable rollout policy would be to move directly to the goal location in a straight line (assuming full observability).
Let me know whenever you get a chance!
Edit: Also, I did some debugging, and it seems like the POMCPOW tree is reaching the terminal state during planning. Moreover, the simulation stops before executing max_steps because the agent reaches its goal which is a terminal state in your implementation.
# import Pkg
# Pkg.add("POMDPs")
# Pkg.add("POMCPOW")
# Pkg.add("POMDPModels")
# Pkg.add("POMDPTools")
# Pkg.add("Distributions")
import POMDPs
import POMDPModels
import Distributions
using POMDPs using POMCPOW using POMDPModels using POMDPTools using POMDPs using POMDPTools using LinearAlgebra using Distributions using Random
mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}
translate_step::Float64
translate_std::Float64
obs_std::Float64
discount::Float64
# Map bounds and positions
map_lower_bound::Vector{Float64}
map_upper_bound::Vector{Float64}
fixed_start_pos::Vector{Float64}
goal_position::Vector{Float64}
# Mountain parameters
slope::Vector{Float64}
mvn_1::MvNormal
mvn_2::MvNormal
mvn_3::MvNormal
# Constructor
function MordorHikePOMDP()
new(
0.1, # translate_step
0.05, # translate_std
0.1, # obs_std
0.99, # discount
[-1.0, -1.0], # map_lower_bound
[1.0, 1.0], # map_upper_bound
[-0.8, -0.8], # fixed_start_pos
[0.8, 0.8], # goal_position
[0.2, 0.2], # slope
MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]), # mvn_1
MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]), # mvn_2
MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01]) # mvn_3
)
end
end
function POMDPs.actions(pomdp::MordorHikePOMDP) return [1, 2, 3, 4] end
function POMDPs.initialstate(pomdp::MordorHikePOMDP)
return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))
end
function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64}) if isterminal(pomdp, s) 0.0 else return calculate_altitude(pomdp, sp[1:2]) end end POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount
function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64}) return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step end
function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64) return ImplicitDistribution(function(rng) theta = s[3]
# Create movement vector based on theta
forward_vector = [cos(theta), sin(theta)]
lateral_vector = [-sin(theta), cos(theta)]
# Map actions to movement
movement = if a == 1 # North
forward_vector
elseif a == 2 # South
-forward_vector
elseif a == 3 # East
lateral_vector
else # West
-lateral_vector
end
next_pos = s[1:2] + movement * pomdp.translate_step
# Add noise to position
next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)
# Clip to bounds
next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)
# Keep the same theta
return [next_pos[1], next_pos[2], theta]
end)
end
function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64}) altitude = calculate_altitude(pomdp, sp[1:2]) return Normal(altitude, pomdp.obs_std) end
function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64}) return observation(pomdp, a, sp) end
function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})
pos_2d = pos[1:2] # Take only x,y coordinates if pos is longer
mountains = [
pdf(pomdp.mvn_1, pos_2d),
pdf(pomdp.mvn_2, pos_2d),
pdf(pomdp.mvn_3, pos_2d)
]
altitude = maximum(mountains)
return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02
end
pomdp = MordorHikePOMDP()
using POMCPOW solver = POMCPOWSolver(criterion=MaxUCB(20.0)) planner = solve(solver, pomdp)
hr = HistoryRecorder(max_steps=200,) hist = simulate(hr, pomdp, planner) for (s, b, a, r, sp, o) in hist @show s, a, r, sp println(norm(s[1:2] - pomdp.goal_position)) end
rhist = simulate(hr, pomdp, RandomPolicy(pomdp)) println(""" Cumulative Discounted Reward (for 1 simulation) Random: $(discounted_reward(rhist)) POMCPOW: $(discounted_reward(hist)) """)
2. Hmm, I can implement the cross-entropy method ( a black box optimization) as a rollout policy or maybe even a greedy one that goes directly to the goal. Is there an example of how to specify this in the API?
When I ran the above code and checked the trajectory (hist), the final state did not satisfy the is_terminal condition of being near [0.8, 0.8]! Do you get a different result?
If I try `POMCPOWSolver(criterion=MaxUCB(20.0), tree_queries=10000, max_depth=10)` it reaches 200 steps without solving the task but better than random performance.
Ok, sorry for too many updates. I reran the solver on the updated code. It solves the task sometimes and sometimes it does not. I guess a non-random rollout policy would help here? Can you help me define it? I don't understand the API that well.
Yeah, hopefully, a better rollout policy will increase the number of times it solves the task.
This would be the general layout for the rollout policy code. estimate_value
is the parameter you use to specify the rollout policy/function.
struct GoToGoalPolicy{P} <: Policy
pomdp::P
end
function POMDPs.action(p::GoToGoalPolicy, s)
# ROLLOUT CODE TO RETURN AN ACTION FOR THE INPUT STATE s
end
rollout_policy = FORollout(GoToGoalPolicy(pomdp))
solver = POMCPOWSolver(
estimate_value=rollout_policy,
criterion=MaxUCB(20.0),
tree_queries=10_000,
max_depth=10
)
Let me know how well it performs after using a better rollout.
I am working on a POMDP called VaryingMountainHike (which I have redubbed as MordorHike :wink: ). It was introduced here. It is basically a 2d environment with mountains defined by 3 gaussians. The agent has to get from bottom left to top right. The rewards are always negative, and it is least negative on top of the mountains.
I am totally new to julia and pomdp planners. So bear with me if I had made some bugs in the implementation.
I need help on knowing what are the reasonable hyperparameters for the solver for this problem. When I used the default tree queries the rollout terminated before max steps without reaching the terminal state.