arXiv / html_feedback

Supports a student project developing a UI for feedback on arXiv articles rendered as html.
MIT License
18 stars 3 forks source link

translation #2284

Open Jackeylove1103 opened 1 month ago

Jackeylove1103 commented 1 month ago

Description

no translation here

(Optional:) Please add any files, screenshots, or other information here.

No response

(Required) What is this issue most closely related to? Select one.

Choose One

Internal issue ID

04f74a44-fb07-45ea-8470-f2204e1c9d03

Paper URL

https://arxiv.org/html/2309.17179?_immersive_translate_auto_translate=1

Browser

Chrome/129.0.0.0

Device Type

Desktop

html-feedback-bot[bot] commented 1 month ago

Location in document: S3.SS1.p2.6

Selected HTML:

For a given natural language task, we can define a reward function R(yt|𝐱0:L1,𝐲0:t1)𝑅conditionalsubscript𝑦𝑡subscript𝐱:0𝐿1subscript𝐲:0𝑡1R(y{t}|\mathbf{x}{0:L-1},\mathbf{y}_{0:t-1})italic_R ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 : italic_L - 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 0 : italic_t - 1 endPOSTSUBSCRIPT ) as the task performance feedback for intermediate generation <math alttext="y{t}" class="ltxMath" display="inline" id="S3.SS1.p2.2.m2.1" data-immersive-translate-walked="a3a22be5-a722-4af1-9e2e-c5a40b6ffc9a">ytsubscript𝑦𝑡y{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t𝑡titalic_t. Due to the lack of large-scale and high-quality intermediate reward labels for general tasks, it is usually a sparse reward setting where any intermediate reward from the first T1𝑇1T-1italic_T - 1 timestep is zero except the last T𝑇Titalic_T-th step. A typical case can be RLHF alignment task, where LLM can receive the reward signal after it completes the full generation. Following the same logic, 𝐲𝐲\mathbf{y}bold_y can also be viewed as a sequence of sentences.

Given the problem formulation above, we successfully transfer the problem of better generation to optimization for higher cumulative reward. In this paper, we focus on how we can optimize it with tree-search algorithms. A specific natural language task typically predefines the state space (with language) and reward function (with task objective/metrics). What remains is the definition of action space, or in the context of tree-search algorithm, the action node.

github-actions[bot] commented 1 month ago

Hello @Jackeylove1103, thanks for the issue report! We are reviewing your report and will address it as soon as possible.

dginev commented 1 month ago

@Jackeylove1103 could you find out why your translation tool fails here and tell us the technical reason?

As it stands it is not clear if arXiv contributed to that effect directly.