Configs / Instructions to Replicated Experiments in Paper

gao-g / prelude

Code for the paper "Aligning LLM Agents by Learning Latent Preference from User Edits".

https://arxiv.org/pdf/2404.15269

MIT License

22 stars 0 forks source link

Configs / Instructions to Replicated Experiments in Paper #2

Closed StephAO closed 1 month ago

StephAO commented 2 months ago

I am hoping to reproduce the results. When can I expect the configs/instructions to replicate the experiments in the paper to be updated?

dkmisra commented 2 months ago

Hi @StephAO. Thanks for your interest. We will provide you with specific instructions to run all experiments within 1-2 days. Please stay tuned.

@gao-g @ataymano .

StephAO commented 2 months ago

Thank you for the update @dkmisra A few additionally notes I've found going through your repository:

The spacing and capitalization in your prompts is inconsistent (e.g. "ARTICLE: {input}" vs " Article:{input}")
Triple quotations keeps all white spaces contained within them, so indenting them on newlines to match the previous lines ends up leaving a a number of white spaces in weird places in the generated prompts.
The paper mentions that you use cosine similarity for document retrieval, and while the mpnet implementation does normalize the tensors, the Bert implementation does not, which means the Bert implementation is doing a dot product, not a cosine similarity. Further, it is standard to use the CLS token to compare strings, however, the current implementation averages all the tokens instead --- is there a reason for this?

ataymano commented 2 months ago

Hi @StephAO, Thank you for your interest! Please find the experiments folder with the running instructions here Please let me know if you have more questions or need any further help.

dkmisra commented 2 months ago

Hi @StephAO

I can comment on these other comments. By the way, thanks for very precise and helpful comments.

We'd standardize them and re-run the experiments. @gao-g and @ataymano (noting for next version). I doubt this will cause a significant difference but let's see what the new results say.
Would you happen to have an example of this? @ataymano
You are right that the code isn't normalizing BERT embeddings. This needs to be fixed. We'd do that in the next version.

We didn't compare against CLS embedding but the use of averaging makes sense to me and has been used in the literature for text encoding (e.g., 1, 2). A CLS token might be more focused on certain words, perhaps towards the boundaries of the text. In contrast, averaging gives more weight to each token but can dilute important words. I feel in hindsight we should have used something like BertScore.

StephAO commented 2 months ago

@dkmisra Thank you for your reply. As you mention, I suspect most of these won't have any significant impact on the takeaways of the work.

For 2. all the prompts are in this format (for example ). Easy to fix by using a different notation (see below).

>>> edit_prompt = f"""Email: {output} \n
...                   Assume that you prefer {preference}. 
...                   Please revise the above email to meet your style:"""
>>> 
>>> edit_prompt
'Email: output \n\n                  Assume that you prefer preference. \n                  Please revise the above email to meet your style:'
>>> edit_prompt2 = (f"Email: {output} \n"
...                 f"Assume that you prefer {preference}. "
...                 f"Please revise the above email to meet your style: ")
>>> edit_prompt2
'Email: output \nAssume that you prefer preference. Please revise the above email to meet your style: '
>>>

StephAO commented 2 months ago

Another quick note. In the paper, you include "question answering style, direct, concise" as the preferences for Movie Review, however in the code, it is only "question answering style" https://github.com/gao-g/prelude/blob/8fe20a8f1332090ccdd2f9498b7d7c33d0de7c49/src/task/summarization.py#L15

gao-g commented 2 months ago

Thanks so much for catching this mistake! We will correct the paper by changing the latent preference for movie review to "question answering style" in order to match the codebase. We did use "question answering style, direct, concise" in early experiments, and later found that "question answering style" is sufficient to guide LLM to give reasonable behaviors. Sorry about this confusion, and thanks again for raising this issue. We'd love to acknowledge you in our acknowledgement section in our new revision.

StephAO commented 1 month ago

I found your work very interesting, so I'm happy to help. While unnecessary, if you would like to acknowledge me, my full name is Stéphane Aroca-Ouellette

dkmisra commented 1 month ago

We would totally love to acknowledge your help @StephAO. We will add your name to the acknowledge section of the next revision to the paper where we also plan to accommodate most of your comments here.

And please feel free to send us any papers you write in this space. Thanks again for very useful feedback.