Simplified & sped up evaluation script by moving rouge score computation to separate script

Simplification:

evaluate summaries always runs on both extrinsic datasets (data_subset is no longer a parameter).
rouge scores script always compute rouge scores for the full set.

I didn't run compute_rouge_scores on the full set yet since i don't have enough bandwidth to download the full logs from paperspace. Will do once I'm back home. I did verify that the rouge score script works by testing on a smaller subset of the data.

danton-nlp / GEF

Simplified & sped up evaluation script by moving rouge score computation to separate script #120