I have a question regarding the tokenization process implemented in the repo. It appears that the before_ids, target_ids, after_ids, and optim_str_ids are tokenized separately. However, when reintegrating optim_str back into the original messages and performing tokenization again, the token IDs for the optim_str segment may differ from those generated when optim_str is tokenized independently, without preceding context.
hi authors, thanks for the great work!
I have a question regarding the tokenization process implemented in the repo. It appears that the before_ids, target_ids, after_ids, and optim_str_ids are tokenized separately. However, when reintegrating optim_str back into the original messages and performing tokenization again, the token IDs for the optim_str segment may differ from those generated when optim_str is tokenized independently, without preceding context.
Would this be fixed in the later version?
Thanks!