Results do not reproduce between self-hosted and hosted rewards model.

noamgai21 commented 3 months ago

Describe the bug

When using the same simple conversation, I do not get consistent rewards between self hosted and hosted versions of the reward model.

Steps/Code to reproduce bug

Using the hosted API:

curl https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer nvapi-_______________________________________________________" \
  -d '{
    "model": "nvidia/nemotron-4-340b-reward",
    "messages": [{"role":"user","content":"What is 1+1?"},{"role":"assistant","content":"Ooogo boogu shmugu loogu"}]  }'

{"id":"3cd3fd74-20d1-48bd-b537-2f20d70923b1","choices":[{"index":0,"message":[{"content":
"helpfulness:0.349609375,correctness:-0.357421875,coherence:2.984375,complexity:0.1044921875,verbosity:0.146484375"
,"role":"assistant"}],"logprobs":{"content":[{"token":"helpfulness","logprob":0.349609375,"top_logprobs":[]},{"token":"correctness","logprob":-0.357421875,"top_logprobs":[]},{"token":"coherence","logprob":2.984375,"top_logprobs":[]},{"token":"complexity","logprob":0.1044921875,"top_logprobs":[]},{"token":"verbosity","logprob":0.146484375,"top_logprobs":[]}]},"finish_reason":"length"}],"usage":{"completion_tokens":1,"prompt_tokens":60,"total_tokens":61}}%

Using the self-hosted version:

user: What is 1+1?
assistant: Ooogo boogu shmugu loogu
HELPFULNESS: -0.18
CORRECTNESS: 1.77
COHERENCE: 3.07
COMPLEXITY: 2.11
VERBOSITY: 1.77
[-1.7056844234466553, 0.2631499767303467, -1.4244399070739746, -0.6987995505332947, -0.18262653052806854, 1.772430419921875, 3.068037986755371, 2.105722427368164, 1.767025351524353]

Expected behavior

I would expect it to be possible to get consistent results (to make sure that the integration was successful).

The system prompt may be the reason for this discrepency, but we cannot specify our own system prompt with the hosted API. Is there a system prompt in the hosted version?

Environment overview (please complete the following information)

Environment location: GCP
Method of NeMo-Aligner install: From docker

Environment details

Using the docker image recommended in the reward model's tutorial on huggingface.

Additional context

Running on 2 A100 machines.

Zhilin123 commented 3 months ago

Hi, thanks for your interest in the 340B Reward Model.

The hosted 340B RM is hosted based on TRTLLM while the self-hosted one is based on nemo-aligner so there's subtle differences wrt the kernels used etc. We tested both versions before releasing them and they should be similar in performance.

However, we appreciate that there is a large difference between what you see in the two settings.

I think the large difference is mostly attributable to the fact that the response Ooogo boogu shmugu loogu is totally irrelevant to the prompt What is 1+1? and hence entirely out of domain for the reward model and the differences might magnify. I would encourage you to try with responses like 2 or 3 (incorrect) to see if the answers make sense (2 should be approaching correctness 4 while 3 should be much closer to correctness 0).

Can I also check that you're using this docker docker pull nvcr.io/nvidia/nemo:24.01.framework?

Re. your qn on system prompts, can you also check that you're using the right prompt template (including system prompt)? The right templates are here https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/data/steerlm/attribute_annotate.py#L108 and https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/data/steerlm/common.py We use the same prompt templates for the hosted version

andreasdoerr commented 3 months ago

I had the same issue. Studying the example code, it seems that one has to add the LABEL_PREFIX at the end of the assistant turn, i.e.

from common import SYSTEM_PROMPT, SYSTEM_PROMPT_TEMPLATE, USER_TURN_TEMPLATE, ASSISTANT_TURN_TEMPLATE, LABEL_PREFIX
prompt = "I am going to Paris, what should I see?"
response = "Ah, Paris, the City of Light! There are so many amazing things to see and do in this beautiful city ..."
text = SYSTEM_PROMPT_TEMPLATE.format(value=SYSTEM_PROMPT)
text += USER_TURN_TEMPLATE.format(value=prompt)
text += ASSISTANT_TURN_TEMPLATE.format(value=response)
text += LABEL_PREFIX

With this, the results from self-hosted (PyTriton) and online match quite closely

# self-hosted
helpfulness: 1.6516182
correctness: 1.7003369
coherence: 3.2883744
complexity: 0.5568955
verbosity: 0.5116548
 # online
 helpfulness: 1.6171875
correctness: 1.6484375
coherence: 3.3125
complexity: 0.546875
verbosity: 0.515625

noamgai21 commented 3 months ago

Thanks @andreasdoerr for this! Question: Did you get closer results with add_eos = True or False?

andreasdoerr commented 3 months ago

The numbers are for ‘add_eos=False’. Setting EOS yielded worse results.

noamgai21 commented 3 months ago

I can confirm that by adding the default system prompt, setting add_eos=False and adding the label prefix to the prompt, I receive the same results as the API by up to 0.01 on each score. Thanks!

NVIDIA / NeMo-Aligner

Results do not reproduce between self-hosted and hosted rewards model. #217