Both the training MSE and RSS decrease monotonically as more features are considered

RSS vs MSE

This paragraph mentioned that it is not advisable to use RSS as the performance metric, but MSE via cross-validation.

I think the highlight on MSE over RSS is misleading. Note that, given estimate $\hat\beta$,

$$ \mathrm{RSS} = (y-X\hat\beta)^T(y-X\hat\beta) = \sum (y_i - \hat{y}_i)^2 = n \cdot \mathrm{MSE} $$

So, both the training MSE and RSS decrease monotonically as more features are considered, not only RSS.

The cross-validation part should be the correct. That is, we use grid search to find best $s$.