reproduce trlx example on wandb

🐛 Describe the bug

Hello everyone, I am attempting to replicate the trlx example found on Wandb at this link: https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2

I have a specific question regarding the evaluation performance of the step 1 sft model. In the provided example, the GPT-J 6B model was used, and the evaluation performance was as follows:

rouge1: 0.33495259362557406 rouge2: 0.12516756897761228 rougeL: 0.2614311397592001 rougeLsum: 0.2613398039055508

However, when I tried using the Pythia-1B model, the evaluation performance was:

rouge1: 0.6174393246459715 rouge2: 0.2168605526286073 rougeL: 0.42690106548725937 rougeLsum: 0.5442461303596511

and for Pythia-70M, the evaluation performance was:

rouge1: 0.5579889287651427 rouge2: 0.16015618115528707 rougeL: 0.3522189070990923 rougeLsum: 0.48252334790510854

Can someone please provide an explanation for why the Rouge scores for the sft-tuned GPT-J 6B model are so low compared to the Pythia models?

Which trlX version are you using?

followed installation, should be lastest

Additional system and package information

3.9, P3dn24 ec2 instance

I checked the performance of my trained pythia models, and I feel like they are doing ok

prompt:

SUBREDDIT: r/books
TITLE: I'm going there: I cannot bring myself to finish LOTR.
POST: **Notice: I mean no disrespect to fans so please don't take this the wrong way.**

...but I'm more than open to having my mind changed if someone can explain to me what I'm missing.

My job has an hour-long commute, so I joined the library and have been going through podcasts and audiobooks like crazy. 

While sifting through their audiobook collection I saw they had all of the LOTR books, which I've never read -- I wanted to as a teenager but never got around to it. I never watched the movies because I wanted to read the books first because I'm a nerd like that. I knew absolutely nothing about the series other than the brief introduction I had to it while playing the Interplay LOTR adventure game on PC in the mid-90's for half an hour once.

So for the first few discs I found it a little monotonous. Lots of awkward singing by the narrator, lots of painfully long descriptions of the Shire and lots of genealogy for minor characters, which I found odd. But I assumed it'd pick up since I had 14 discs left to go.

I'm currently on disc 9 (right as they're getting to Rivendell) but I absolutely cannot get interested in it, though not for lack of trying. Each commute it gets more difficult for me to keep listening rather than just throw on music or the news. Every time I get to a new disc I feel like I just finished several hours of homework and I have to bargain with myself to start the next one. 

It's not that I don't like it. I like Tolkien's style, the characters are ridiculously well-developed and I can appreciate how groundbreaking it was in the 1950's...I just can't figure out what's so interesting and exciting about it to so many people, and I certainly can't imagine spending 50+ more hours finishing out the entire trilogy. 

Can someone change my mind before I bail and take it back to the library on Tuesday? 

If it matters in your analysis, my normal taste is Philip K. Dick, Kurt Vonnegut, David Sedaris and non-fiction about science, politics and religion. My favorite book is Good Omens by Gaiman/Pratchett.
TL;DR:

label

Can't get into Lord of the Rings but I'm open to giving it another shot if someone can tell me why I should.

pythia 1b output

TL;DR: ~~I want to know why you think LOTR is worth reading when most of us who haven't seen them will probably be disappointed at some point.<|endoftext|>

pythia 70m output

TL;DR: ive lost interest in reading sci-fi novels, now I am stuck at home alone and need some advice from others who know.<|endoftext|>

gptj-6b output (from https://huggingface.co/CarperAI/openai_summarize_tldr_sft)

TL;DR: ~~Why do you think everyone loves LotR when it feels like such drudgery to me despite being able to enjoy most things else?<|endoftext|>

I feel like what gptj-6b generated is at least better than my finetuned pythia 60m model if not better than the pythia 1b model. Why the rough-scores are much lower?

CarperAI / trlx