Some concerns about COLO

lxx909546478 commented 2 years ago

Great work! And there were a few points I wanted to know after reading.

I notice that BRIO also adopts one model as both summarizer and Ranker. Does COLO have the similar idea as BRIO, are there any differences between them?
I notice that ROUGE ranking is facing combination explosion problem. Is this combination necessary to the ranker? If we give each sentence a predicted score, and select top k sentences rather than 2~3 sentences' combination, would we get a bad result?

Looking forward to your reply.

ChenxinAn-fdu commented 2 years ago

Good questions.

I think the main difference between BRIO and the abstractive version of COLO may lie in the loss function. The loss function in BIRO is a reinforcement learning loss which is used to learn the order. While COLO also follows a contrastive learning paradigm. And BRIO may not suitable for extractive methods.
COLO has already used the predicted score(modeled by BCEloss) to clip the sentence number of the document to K sentences. Because most summaries in CNNDM has 2~3 sentences, and then we further get the Candidates size by C(K, 2) + C(K,3)

lxx909546478 commented 2 years ago

Good questions.

I think the main difference between BRIO and the abstractive version of COLO may lie in the loss function. The loss function in BIRO is a reinforcement learning loss which is used to learn the order. While COLO also follows a contrastive learning paradigm. And BRIO may not suitable for extractive methods.

COLO has already used the predicted score(modeled by BCEloss) to clip the sentence number of the document to K sentences. Because most summaries in CNNDM has 2~3 sentences, and then we further get the Candidates size by C(K, 2) + C(K,3)

Thanks for your reply.

I agree with you that COLO is well-designed for extractive task. But BRIO also adopts contrastive learning paradigm(using margin ranking loss) rather than reinforcement learning to fit ROUGE score. I think COLO abstractive method and BRIO almost do the same thing in different ways. COLO uses InfoNCE loss(maybe?), and BRIO uses margin ranking loss.
Perhaps my incomplete expressions have caused misunderstanding. What I really means is have you tried to give each sentence a score and select 2~3 sentence with the highest score, it will have a complexity of O(n) rather than O(n^2). I think if it works, maybe it shows that the concatenate of sentences with the highest score has a high ROUGE score.

ChenxinAn-fdu commented 2 years ago

BRIO contrasts the likelihood of sequence and COLO contrasts cosine similarity score which means their loss function makes the sequence with high rouge score has a larger likelihood. The loss function of BIRO is actually equivalent to r_1*p(y1) + r_2*p(y2) .... + rn*p(yn) where r means a reward and p(y) means the likelihood of a sequence. However, they do not directly optimize the rouge score but optimize the order which may be easily to estimate and that may lead to the success of BRIO.
I did not try this in COLO. I think you can try this with our open source code~

lxx909546478 commented 2 years ago

Thanks a lot! COLO is an inspiring work. I would try in other experiment settings.

ChenxinAn-fdu / CoLo

Some concerns about COLO #1