In train.py, since it calculates the metrics once in a while, it doesn't represent the metrics of whole dataset.
In evaluate.py, since the size of dataset may be not divisible by the batch size, the calculated metrics are not precise, either. The better way is to calculate a weighted sum of the mean values of batches which weighted by the number of examples and divide it by the size of whole dataset.
The code in pytorch/vision/train.py and pytorch/vision/evaluate.py describe how to calculate metrics with batches of data.
In train.py, since it calculates the metrics once in a while, it doesn't represent the metrics of whole dataset. In evaluate.py, since the size of dataset may be not divisible by the batch size, the calculated metrics are not precise, either. The better way is to calculate a weighted sum of the mean values of batches which weighted by the number of examples and divide it by the size of whole dataset.