cognitivecomputations / OpenChatML

147 stars 10 forks source link

A few thoughts/questions: #3

Open josephrocca opened 6 months ago

josephrocca commented 6 months ago

Please excuse any ignorance here - I don't have a lot of experience in the lower levels / finer details of tokenizers and fine-tuning. Also, I don't expect a reply to each of these points, to be clear - just dropping some thoughts for you to skim in case they're helpful.

electricazimuth commented 6 months ago

Thinking about this in terms of a user interface and trying to reduce cognitive load for the non technical users that are typing these in, I think the full "word-y" versions that have _end and _start in the markup tags is probably best.

I do prefer having an underscore (start_of_thought) version rather than the compressed (startofthought) version. In terms of using colons or back slashes in an XML style, I think its really easy for a tired eyed or distracted user to mix up <|thought:|> with <|:thought|> , just looking at that now, and searching for where the ":" should go gives me an annoying amount of cognitive load!

In a hope for instant understandability I'd suggest changing the "im" prefix to just "message" eg im_start => message_start

I love the idea of having a rule to symantically annotate supplied data using something like "file_separator", currently I have to trial different "user land" solutions (using new lines with asterix / equals signs etc..) a lot of which fail, it would be a huge benefit for my workflow to have at least something in the spec to aim for. Although "file_separator" could be something more generic like "data_separator" or "info_snippet" in my case I wouldn't class the stuff I'm supplying as files.

SamuelTallet commented 6 months ago

im should stand for input message

Source: https://community.openai.com/t/what-do-the-im-start-and-im-end-tokens-mean/145727/2

electricazimuth commented 6 months ago

This is the point; the terms should be semantic and no one should need to search to find out what they mean, using plain and direct terms helps everyone use it, in this case they'll all get tokenised so length isn't much of an issue, there's no reason to use a shortened, technical or unobvious term.

josephrocca commented 6 months ago

Agreed.

FWIW, I'd definitely change _start and _end to something which makes them distinct from the name of the tag - e.g. something like <|thought:start|> <|thought:end|> keeps syntactic clarity and avoids tired-eyes mistakes that you mentioned earlier. Choosing to mix start/end semantics with tag name syntax seems nice for the simplicity, but is something that I think could come back to bite if this needs to be extended - e.g. you can end up with stuff like <|foo_start_start|> or <|start_foo_start|>, and potential separators that have start/end (or synonyms) in their name could be confusing. If :end/:start isn't used, then separator tags should probably always end in _separator - which isn't too bad, but it could just be <|foo|> (i.e. no :end/:start implies self-closing tag).

This may all seem pedantic, but I see no downside in being explicit here, and indeterminate upside. This field is young enough that it may turn out we some weird stuff - e.g. tags with properties, at which point you'd of course not want to use underscores to delimit the properties too. I wouldn't over-complicate a spec for some completely unforeseen factors that might require it, of course (can always just shrug and write a new spec), but I don't see this as a complication - rather, I see _start, _end as the complication (i.e. overloading of _).

WolframRavenwolf commented 6 months ago

Great arguments here from everyone involved! You've said everything that's on my mind at the moment. So I have nothing to add except a few thumbs up and encouragement for everyone else who comes here to read all the comments at length!