fix sustainability metric

Old sustainability metric was calculating something like the fraction of agent-timesteps with nonzero reward. What we want is S: the average timestep number where an agent receives a nonzero reward.

Intuitively, if agents are gathering apples later and later into the episode, the environment is becoming more sustainable.

RedTachyon / cpr_reputation

fix sustainability metric #47