Unfortunately our current method of getting the final layer LM predictions relies on the return_offsets_mapping flag on the tokenizer, which isn't supported for "non-fast" tokenizers (ones implemented in Python and not Rust). The LLaMA tokenizer is currently implemented in Python, as well as the UnifiedQA v2 tokenizer.
This PR changes the way that we concatenate questions and answers together, concatenating them after tokenization rather than before, so that we can support the slow tokenizers.
Unfortunately our current method of getting the final layer LM predictions relies on the
return_offsets_mapping
flag on the tokenizer, which isn't supported for "non-fast" tokenizers (ones implemented in Python and not Rust). The LLaMA tokenizer is currently implemented in Python, as well as the UnifiedQA v2 tokenizer.This PR changes the way that we concatenate questions and answers together, concatenating them after tokenization rather than before, so that we can support the slow tokenizers.