Open YoungOnef opened 1 year ago
MoveToGoal1Object
[SerializeField] private float timePenalty = -0.01f;
OnActionReceived SetReward(timePenalty * (1 - Mathf.Abs(rotateAmount) - Mathf.Abs(moveAmount)));
OnTriggerEnter
if (other.CompareTag("Goal"))
{
SetReward(+1.0f);
targets.Remove(other.transform);
Destroy(other.gameObject);
// End episode if all targets are collected
if (targets.Count == 0)
{
float timeRemaining = (MaxStep - StepCount) / (float)MaxStep;
SetReward(1f + timeRemaining);
EndEpisode();
}
}
if (other.CompareTag("Wall"))
{
SetReward(-10f);
EndEpisode();
}
}
Training YML file behaviors: MovetoGoal: trainer_type: ppo hyperparameters: batch_size: 10 buffer_size: 100 learning_rate: 3.0e-4 beta: 5.0e-4 epsilon: 0.2 lambd: 0.99 num_epoch: 3 learning_rate_schedule: linear beta_schedule: constant epsilon_schedule: linear network_settings: normalize: false hidden_units: 128 num_layers: 2 reward_signals: extrinsic: gamma: 0.99 strength: 1.0 max_steps: 500000 time_horizon: 64 summary_freq: 10000
moveToGoalNewConfig1Goal
behaviors:
MovetoGoal:
trainer_type: ppo
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 1.0e-4 # Adjusted learning rate
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
network_settings:
normalize: true # Use normalized inputs
hidden_units: 256 # Increased hidden units
num_layers: 3 # Increased number of layers
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 1000000 # Increased max steps
time_horizon: 128 # Increased time horizon
summary_freq: 5000 # Increased summary frequency
moveToGoalNewConfig1GoalNew
behaviors: MovetoGoal: trainer_type: ppo hyperparameters: batch_size: 128 buffer_size: 2048 learning_rate: 0.0003 beta: 0.005 epsilon: 0.2 lambd: 0.95 num_epoch: 3 learning_rate_schedule: linear network_settings: normalize: false hidden_units: 256 num_layers: 2 vis_encode_type: simple reward_signals: extrinsic: gamma: 0.99 strength: 1.0 curiosity: strength: 0.02 gamma: 0.99 encoding_size: 256 learning_rate: 3.0e-4 max_steps: 1000000 time_horizon: 128 summary_freq: 10000 threaded: true
moveToGoalNewConfig1Goal2
behaviors: MovetoGoal: trainer_type: ppo hyperparameters: batch_size: 128 buffer_size: 2048 learning_rate: 1.0e-4 # Adjusted learning rate beta: 5.0e-4 epsilon: 0.2 lambd: 0.99 num_epoch: 3 learning_rate_schedule: linear beta_schedule: constant epsilon_schedule: linear network_settings: normalize: true # Use normalized inputs hidden_units: 256 # Increased hidden units num_layers: 3 # Increased number of layers reward_signals: extrinsic: gamma: 0.99 strength: 1.0 curiosity: strength: 0.02 gamma: 0.99 encoding_size: 256 learning_rate: 3.0e-4 max_steps: 1000000 # Increased max steps time_horizon: 128 # Increased time horizon summary_freq: 5000 # Increased summary frequency threaded: true
Modifed the SetReward(+0.1f), This way AI will be more willing to collect the coins and try to catch them all
if (other.CompareTag("Goal"))
{
SetReward(+0.1f);
targets.Remove(other.transform);
Destroy(other.gameObject);
// End episode if all targets are collected
if (targets.Count == 0)
{
float timeRemaining = (MaxStep - StepCount) / (float)MaxStep;
SetReward(1f + timeRemaining);
EndEpisode();
}
}
moveToGoalNewConfig1GoalGroupOf2 inherited from moveToGoalNewConfig1Goal2 Data and used the same config training file Two Objects to collect
This line of the code wasn't calculating the Penalty correctly. It was still a postive number even thought it should be negative value SetReward(timePenalty * (1 - Mathf.Abs(rotateAmount) - Mathf.Abs(moveAmount)));
Corrected the problem with adding better reward based on TimeRamining, when all of the targets are collected
//If the agent collides with a target, remove the target from the list and add a reward
if (other.CompareTag("Goal"))
{
SetReward(+0.1f);
targets.Remove(other.transform);
Destroy(other.gameObject);
// End episode if all targets have been collected
if (targets.Count == 0)
{
print("Targets collected");
float timeRemaining = (MaxStep - StepCount) / (float)MaxStep;
SetReward(1f + timeRemaining);
EndEpisode();
}
}
CorrectMovetoGoal1Object\MovetoGoal
As the AI still has problem with collecting target collect near the wall, The training specify near the wall didn't help a lot. To help with fixing this type of problem the number of Raycast will be increased to 50, This way the AI will have better understating of whats going on around it.
Overall 50 Raycast will be kept as its show most performance than other solutions like camera pixel as for example it would beed more data passed to AI if the resolution is 32 x 32 which it would be 1024 data inputs than only 50
The mouse AI is the very similar to Cat AI but the goal is to run away, which means that adjusting script is needed for mouse AI.
Within OnEpisodeBegin so it will restart the timer for EndEpisode
timer = 0f;
within OnActionReceived function
// Add reward for time survived
timer += Time.deltaTime;
AddReward(timeSurvivedReward * timer);
FleefromAgent\FleefromAgent
Now Cat AI agent will be running against the mouse AI so it will learn new methods for capturing the mouse
Now training Mouse AI against Cat AI, so mouse AI will learn how not to get captured
Cat AI start Full training against all 20 mice, so it will improve capturing all the mices
Overall Cat AI improved in the final scenario to capture all 20 mices and mices were running away from cat AI. This shows that ML agents are working correctly for this scenario and AIs are successfully created for this type of scenario
Different attempts at training.
I tried to give the ML agent the location of itself and the targets as inputs, but I didn’t know how to make it work. The ML agent needs to have consistent inputs, but when the targets were collected, they were deleted and the ML agent still had the old data. This made the ML agent go to random places with a very low success rate. So I chose 3D raycast as a sensor for the ML agent. This way, the inputs for the ML agent are consistent and updated.