Closed polplop closed 5 months ago
Investigating a solution.
Related: https://github.com/huggingface/transformers/issues/31030
Problem:
It appears the tokenizer represents 198
differently between tokenizer.vocabulary()
and tokenizer.decode()
>>> tokenizer.decode([198])
['\n']
>>> [(k, v) for k, v in tokenizer.vocabulary().items() if v == 198][0][0].encode()
'Ċ'
This isn't the case for other tokens
>>> tokenizer.decode([10])
['+']
>>> [(k, v) for k, v in tokenizer.vocabulary().items() if v == 10][0][0]
'+'
from transformers import AutoTokenizer
tokenizer = TransformerTokenizer(
AutoTokenizer.from_pretrained("failspy/Meta-Llama-3-8B-Instruct-abliterated-v3")
)
bad_tokens = []
for vocab_token_str, token_id in tokenizer.vocabulary.items():
decoded_token_str = tokenizer.decode([token_id])[0]
if decoded_token_str != vocab_token_str:
bad_tokens.append((decoded_token_str, vocab_token_str))
if bad_tokens:
bad_tok_output = '\n'.join(map(repr, bad_tokens))
raise Exception(f"Found {len(bad_tokens)} bad tokens: {bad_tok_output}")
Found these inconsistent tokens:
E Exception: Found 78029 bad tokens: (' ROOM', 'ĠROOM')
E (' 않는', 'ĠìķĬëĬĶ')
E (' Overse', 'ĠOverse')
E (' slov', 'Ġslov')
E ('�', 'æ¦')
E (' Infragistics', 'ĠInfragistics')
E ('�', 'çĻ')
E (' DIFF', 'ĠDIFF')
E (' 武', 'ĠæѦ')
E (' eighth', 'Ġeighth')
...
I'm looking into whether we should be constructing a "true vocabulary" by decoding each token.
It appears we already have a method to normalize:
class TransformerTokenizer(Tokenizer):
...
def convert_token_to_string(self, token: str) -> str:
from transformers.file_utils import SPIECE_UNDERLINE
string = self.tokenizer.convert_tokens_to_string([token])
Investigating the reason this failed to prevent a \n
during generation.
Describe the issue as clearly as possible:
I'm currently attempting to summarize an article and classify the relevancy, which worked fine on outlines 0.0.36, however upgrading to outlines 0.0.43 produces a validation error which did not occur before.
I have tried:
The model seems to be unable to generate valid json and there is an "Invalid control character at" bug that occurs during pydantic validation
notes: Running on #20~22.04.1-Ubuntu, AWS instance with A10G GPU, Cuda 12.1 llama_cpp_python==0.2.77 outlines==0.0.43
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
Context for the issue:
I would like to improve the performance of my summarization and classification pipeline with the newer Llama 3 gguf models. The current performance on the older 0.0.36 outlines library also has some number formatting issues.
No other issue has brought up any problems with Llama3 ggufs, but all of the finetunes I have tried also have the same issue. Either i'm doing something wrong or there is a signification Llama 3 gguf issue that there should be a discussion about. Thank you!