for the special reward setting in this work, better policy will select the sentences in the bag that has higher logP(r|xi), the best result is find the max one, which means finding one max sentence for each bag and feed it to train the classifier. Is that correct?
for the special reward setting in this work, better policy will select the sentences in the bag that has higher logP(r|xi), the best result is find the max one, which means finding one max sentence for each bag and feed it to train the classifier. Is that correct?