Open tsw123678 opened 3 weeks ago
Are there any differences in the _make_masks function across different LLM models? Don't they all compute loss only for the response part? What causes the variations among them?
Different models use different tokenizers, and when different tokenizers tokenize the text, the corresponding label positions are different.
Are there any differences in the _make_masks function across different LLM models? Don't they all compute loss only for the response part? What causes the variations among them?