daulet / tokenizers

Go bindings for HuggingFace Tokenizer
MIT License
85 stars 23 forks source link

Updated tokenizers for support with Llama models #10

Closed sam-ulrich1 closed 10 months ago

sam-ulrich1 commented 11 months ago

Updated tokenizers to 0.14.1 for Llama support.

Take a look at this function for me. The double pattern match looks ugly not sure if there is a better way

#[no_mangle]
pub extern "C" fn from_bytes_with_truncation(bytes: *const u8, len: u32, max_len: usize, dir: u8) -> *mut Tokenizer {
    let bytes_slice = unsafe { std::slice::from_raw_parts(bytes, len as usize) };

    match Tokenizer::from_bytes(bytes_slice) {
        Ok(mut tokenizer) => {
            match tokenizer.with_truncation(Some(tokenizers::tokenizer::TruncationParams{
                max_length: max_len,
                direction: match dir {
                    0 => tokenizers::tokenizer::TruncationDirection::Left,
                    1 => tokenizers::tokenizer::TruncationDirection::Right,
                    _ => panic!("invalid truncation direction"),
                },
                ..Default::default()
            })) {
                Ok(_) => Box::into_raw(Box::new(tokenizer)),
                Err(err) => {
                    println!("failed to apply truncation to tokenizer: {}", err);
                    std::ptr::null_mut()
                }
            }
        },
        Err(err) => {
            println!("failed to create tokenizer: {}", err);
            std::ptr::null_mut()
        }
    }
}
daulet commented 10 months ago

I've updated the rust tokenizers version in #12, please take a look if that satisfies this @sam-ulrich1

sam-ulrich1 commented 10 months ago

If it uses the latest transformers it should be good!

daulet commented 10 months ago

If it uses the latest transformers it should be good!

Try loading this config for llama2: https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/tokenizer.json