StatsRecorder custom statistic always 0

poehlerflorian commented 4 years ago

Describe the bug Adding a custom statistic with Academy.Instance.StatsRecorder.Add(...) produces 0 in some cases even if the supplied value is not 0.

I am creating a sir model to simulate infectious diseases like the corona virus for an university project. I'd like to know how often the citizens bumped into each other and what healthstatus (susceptible, infected, recovered) they had at the time. The Healthstatus is an enum and on each collision I increase a value in a dictionary. At the end of each episode I simply loop through the enum and add the values in the dictionary to the StatsRecorder. This works fine for infected and recovered, but not for susceptible for some reason even though it does have a similar value as infected and recovered when I am debugging. I also have another metric I'd like to add to the tensorboard that always produces 0 even though it does have a value greater than 0.

To Reproduce I have not been able to reproduce this in on of the example environments yet, but I will try to do so and add instructions later. I just wanted to first create this bug report, hoping that maybe others also had this problem.

Console logs / stack traces

Screenshots Tensorboard

If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Unity Version: 2019.3.15f1
OS + version: Windows 10
ML-Agents version: com.unity.ml-agents_1.1.0 (7edf429d09224006c27715bef1ea6011fb7b1594)
TensorFlow version: 2.2.0
Environment: --

chriselion commented 4 years ago

I tried and was unable to reproduce this. My code was extremely simple:

        var statsRecorder = Academy.Instance.StatsRecorder;
        statsRecorder.Add("Susceptible", 2.0f + Random.value);
        statsRecorder.Add("Infected", 3.0f + Random.value);
        statsRecorder.Add("Recovered", 4.0f + Random.value);

and on the 3DBall scene, it produced this:

At this point, I suspect the error is in your code, but hard to say without seeing more of what you're doing.

poehlerflorian commented 4 years ago

That was my first thought, too, but I just do not see where the error could be since the relevant code for that is quite simple. Outputting the value to the console before calling StatsRecorder.Add() produces the correct result while the tensorboard still shows 0.

foreach (HealthStatus healthStatus in Enum.GetValues(typeof(HealthStatus))) {
    Debug.Log($"{healthStatus}: {_collisionCount[healthStatus]}");
    Academy.Instance.StatsRecorder.Add($"Collisions/{healthStatus}", _collisionCount[healthStatus]);
}

Console Log This is just a small part of the log, but the rest is similar with just a few 0s.

Tensorboard Statistics

chriselion commented 4 years ago

Hmm, still not sure what's going on. I might need you to debug a bit into the python code.

Can you try clone the repo (you probably want the release_3 tag) and run the local development steps here? https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md#advanced-local-installation-for-development-2

I think the relevant parts of the code to see where the information is going bad are:

Where python receives the data from C#: https://github.com/Unity-Technologies/ml-agents/blob/release_3/ml-agents-envs/mlagents_envs/side_channel/stats_side_channel.py#L39
Where we aggregate the values for the StatsReporter: https://github.com/Unity-Technologies/ml-agents/blob/release_3/ml-agents/mlagents/trainers/agent_processor.py#L321
Where we actually write to Tensorboard: https://github.com/Unity-Technologies/ml-agents/blob/release_3/ml-agents/mlagents/trainers/stats.py#L177

Can you add some print statements around there and see if the values are what you'd expect? The last one is probably the most important; if the values look right there, then there's something wrong with Tensorboard displaying them (and thus, beyond my capacity to help).

Let me know how that goes, and based on what you find, I'll figure out how to debug further (but this should narrow the scope of the problem a lot).

poehlerflorian commented 4 years ago

Alright, as usual, the error is my own fault, and could have been prevented by simply reading the documentation more carefully. Great.. But let me explain.

Since we are creating a SIR model, we of course have more than one agent. Our goal was to track the average number of collisions of all agents in each simulation. Luckily, ml agents provides us with a method that takes the value of each agent and sends the average to the tensorboard, right? Well, kind of. If I had read the summary of StatAggregationMethod.Average, I would have noticed that it clearly stated that values from the same C# environment in the same step may replace each other. This means that we now know, that all of our recent tensorboard statistics are wrong because only the last value supplied is actually being used. Whoops.

But why was it always 0? Well, the statistics are being added in the Kill() method which is called at the end of each episode for each agent in the same order as during the instantiation. But the last agent created is always already infected from the start, which is why he could not collide with other agents while being in the susceptible state. Therefore, the tensorboard always showed 0.

I would not have been able to discover this, and still trust incorrect statistics without your help. Your instructions were really easy to follow and led me precisely to my mistake and this conclusion. Thank you for that!

On a side note though: Are there any plans to allow adding multiple values per environment and step in the future?

chriselion commented 4 years ago

Hi, I had another look at the code and I think it would be pretty straightforward to prevent the overwriting. We'd need to change the handling in StatsSideChannel to store a list values for each key, instead of a single value, and make sure this gets propagated to the StatsReporter.

If you'd like to try to make this change, we can definitely merge it. Otherwise, I'll try to get to it soon, but no promises on when.

Alternatively, you can maintain a list of values in C#, and average them and send them to the StatsRecorder in a FixedUpdate step.

poehlerflorian commented 4 years ago

Ok, that sounds good. If this issue is still open in a few weeks, I will try to change this myself, but at the moment I don't have enough time to do that.

For now, we will handle this in C#, like you said. Should work just fine.

chriselion commented 4 years ago

The internal tracking number for this is MLA-1121

chriselion commented 4 years ago

PR to fix this: https://github.com/Unity-Technologies/ml-agents/pull/4236

chriselion commented 4 years ago

That's fixed in master, and will be in the next release (sometime in August).

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

StatsRecorder custom statistic always 0 #4150