cognitivecomputations / OpenChatML

144 stars 10 forks source link

Wolfram Ravenwolf's comments on OpenChatML #4

Open WolframRavenwolf opened 4 months ago

WolframRavenwolf commented 4 months ago

Hi Eric,

excellent idea! I've always been in favor of a standardized, future-proof prompt format that is both simple and unambiguous. Up until now, I was recommending ChatML (for lack of a better template), but now that there's OpenChatML, I hope we can improve and standardize this. I have read the specification and here are my comments:

Tokens

Message Structure

Also consider an output format specifier, e. g. JSON, YAML, etc., that could be used to clearly specify what format the response should use.

Thought Structure

Good idea! Also consider emotions and actions, those are another layer that could deserve its own tags (or a general tag with a specific qualifier), as we should support the AI outputting emotional states and real or simulated actions in an independent structure from the actual message, e. g. to have TTS use the emotion as a generation parameter of how to speak, without explicitly saying the emotional state out loud. Or actions, which are otherwise often asterisk-delimited, but would be more useful and less ambiguous to have in a clearly defined format (and for video generation/VR/robot control, that might later also be clearly separated from the actual written/spoken text). That's why I suggest <|start_of|>… / <|end_of|>….

Fill-in-the-Middle Tasks

We could get rid of the <|fim_middle|> since insertion will always happen after the prefix and before the suffix, so no extra tag is needed, the insertion position should be clear without that.

Multi-File Sequences

Introducing optional filenames and clear <|file_start|> / <|file_end|> markers (instead of just separators) could streamline the handling of multiple file inputs, ensuring clear demarcation of text and file content within the same prompt. That way we can have text (that's not part of the file) before or after the files. And the end tag could appear on the same line as the last line of the file (if there's no trailing linebreak in the file) or after a newline (if there's a trailing linebreak in the file), so even that could be reflected unambiguously in the prompt.

Examples

Here's how we'd write the original examples with my suggested changes:

Example conversation:

<|im_start|>user Hello there, AI.<|im_end|>
<|im_start|>char Hi. Nice to meet you.<|im_end|>
<|EOS|>

Example conversation with speaker name:

<|im_start|>user Wolfram: Hello there, AI.<|im_end|>
<|im_start|>char Amy: Hi Wolfram. Nice to meet you.<|im_end|>
<|EOS|>

Example fill-in-the-middle task:

<|fim_prefix|>The capital of France is <|fim_suffix|>, which is known for its famous Eiffel Tower.

Example with thought block:

<|im_start|>user What is 17 * 34?<|im_end|>
<|im_start|>char <|start|>thought To multiply 17 by 34, we can break it down:
17 * 34 = 17 * (30 + 4)
        = (17 * 30) + (17 * 4)
        = 510 + 68
        = 578<|end|>thought
17 * 34 = 578.<|im_end|>
<|EOS|>

Example multi-file sequence:

<|file_start|>1st file.txt
This is the content from the first file.
<|file_end|>
<|file_start|>2nd file.txt
This is the content from the second file.
And this is more content from the second file.
<|file_end|>
<|file_start|>3rd file.txt
Finally, this is the content from the third file.
<|file_end|>

New: Multi character chat example:

<|im_start|>system <|start_of|>action Amy appears and greets Wolfram.<|end_of|>action<|im_end|>
<|im_start|>user Wolfram: Hello there, AI.<|im_end|>
<|im_start|>char Amy: <|start_of|>feeling happy<|end_of|>feeling Hi Wolfram. Nice to meet you.<|im_end|>
<|im_start|>user Wolfram: And who's that?<|im_end|>
<|im_start|>char Amy: That's my sister, Ivy.<|im_end|>
<|im_start|>char Ivy: <|start_of|>feeling curious<|end_of|>feeling Hi Wolfram. How are you?<|im_end|>
<|im_start|>user Wolfram: Oh, there's two of you?<|im_end|>
<|im_start|>char Ivy: Yes, of course, why not?<|im_end|>
<|im_start|>char Amy: Yeah, multi-user chats are fun! <|start_of|>action laughs<|end_of|>action<|im_end|>
<|EOS|>

Command R prompt template

Finally, as a big fan of the Prompting Command R document and the very useful additions to the prompt (e. g. Safety Preamble and Style Guide), putting such features into any model using the OpenChatML format would be most welcome.

electricazimuth commented 4 months ago

Love the idea of extensibility of the <|start|>thing and <|end|>thing markup. Couldn't it be applied to all the tags though and simplify everything? eg,

<|start|>message <|start|>system <|start|>action Amy appears and greets Wolfram.<|end|>action<|end|>message   
<|start|>file 1st file.txt <|end|>file
WolframRavenwolf commented 4 months ago

Using <|start|> and <|end|> with strings would cost us more tokens, though, as it couldn't tokenize as a single special token. Especially with such frequent tokens like those, we'd waste a whole bunch of tokens.

It's also a drawback of my own proposal regarding thoughts, and it might actually be better to have more special tokens (one for thoughts, one for actions, one for emotions?) instead of a super-flexible one that only gets used in a few constellations, but costs a lot of tokens (over the whole context). I don't yet know myself what would be better so I'm just making suggestions and encouraging you guys to think about it.

WolframRavenwolf commented 4 months ago

Changed my mind on the generic <|start|> and <|end|> tags:

I thought having a universal tag to start and end special messages like thoughts, emotions, actions, etc. would make sense. However, after having seen the number of special tokens Meta reserved in the Llama 3 Instruct tokenizer, I now think it would be better to just have a bunch of special tokens instead of universal ones plus regular strings. Would save in-context tokens, prevent the string influencing output in unintended ways, and be more readable.

pugzly commented 4 months ago

Yes, superficially it might seem like good idea to have "universal" tokens, but reusing special tokens for different modes, actions, would likely increase confusion for model, resulting in decreased ability to follow instructions, degrading it's performance, especially post-quantization and with large context.