Problem in evaluation - Githubissues

I'm confused how to evaluate. Should I regard the whole paragraph (multi-sentences) as a large sentence and regard the ground truth as a sentence, either? Then put them into bleu, cider (and so on) to evaluate? Or should I change the code of bleu.py and cider.py to evaluate the paragraphs by one sentence (generated) matching one sentence (ground truth)? Hope you can help me with this! Thank you!