not an issue - fyi - https://github.com/fullstackwebdev/localLLM_guidance

johndpope commented 11 months ago

I found your repo digging through github https://github.com/microsoft/guidance/issues/328

I'll have a play with this repo soon - but thought I'd share the above localllm with guidance was great repo to showcase things.

It seems to need some of the code your working on.

UPDATE

I started playing with your code - I like the streaming code. I'm really wanting to plug in llama2 model to use with guidance.

I attempt to use this - but no joy. it's looking for safetensors.

I also have TheBloke_Llama-2-13B-chat-GGML (116gb) - in ./models/ but that doesn't work either.


def run(server_class=HTTPServer, handler_class=MyHandler):
    global MODEL_EXECUTOR

    # if len(sys.argv) < 3:
    #     raise Exception(
    #         "Expected to be invoked with two arguments: model_name and executor"
    #     )

    model_name = "remyxai/ffmperative-7b"# sys.argv[1]

    MODEL_EXECUTOR = 'autogptq' #sys.argv[2]

    setup_models(model_name)

    server_address = ("0.0.0.0", 8000)
    httpd = server_class(server_address, handler_class)
    print("Starting httpd...\n")
    httpd.serve_forever()

if __name__ == "__main__":
    run()

looking at directory structure - can I use pytorch_model-00002-of-00002.bin instead of .safetensors I wonder.

-rw-rw-r-- 1 oem oem  630 Aug  9 09:36 config.json
-rw-rw-r-- 1 oem oem  197 Aug  9 09:36 generation_config.json
-rw-rw-r-- 1 oem oem 1.5K Aug  9 09:36 .gitattributes
lrwxrwxrwx 1 oem oem  149 Aug  9 09:36 pytorch_model-00001-of-00002.bin -> ../../../../../home/oem/.cache/huggingface/hub/models--remyxai--ffmperative-7b/blobs/9510ce1eadd1b50ab810c81de3db04c0ea8e9367cb96f836e400b19b83157588
lrwxrwxrwx 1 oem oem  149 Aug  9 09:36 pytorch_model-00002-of-00002.bin -> ../../../../../home/oem/.cache/huggingface/hub/models--remyxai--ffmperative-7b/blobs/530ee273fc7eb08af0d1a4a64ccf467984fbc73e28c6623645ba9c9989bd4d4f
-rw-rw-r-- 1 oem oem  24K Aug  9 09:36 pytorch_model.bin.index.json
-rw-rw-r-- 1 oem oem 4.6K Aug  9 09:36 README.md
-rw-rw-r-- 1 oem oem  414 Aug  9 09:36 special_tokens_map.json
-rw-rw-r-- 1 oem oem  749 Aug  9 09:36 tokenizer_config.json
-rw-rw-r-- 1 oem oem 489K Aug  9 09:36 tokenizer.model

UPDATE I hacked some code to force guidance to load specific model. but it blows up. I guess I need a wrapper around this.

def setup_models(model_name: str):
    """
    model_name: a Huggingface path like TheBlock/tulu-13B-GPTQ
    """

    # A slight improvement in memory usage by using xformers attention:
    if "--xformers" in sys.argv:
        hijack_llama_attention_xformers()

    model = None

    print(f"Loading model with executor: {MODEL_EXECUTOR}")

    if MODEL_EXECUTOR == "autogptq":
        from llama_autogptq import LLaMAAutoGPTQ

        model = LLaMAAutoGPTQ(model_name)
    elif MODEL_EXECUTOR == "gptq":
        from llama_gptq import LLaMAGPTQ

        model = LLaMAGPTQ(model_name)
    elif MODEL_EXECUTOR == "exllama":
        from llama_exllama import ExLLaMA

        model = ExLLaMA(model_name)
    elif MODEL_EXECUTOR == "ctransformers":
        from llama_ctransformers import LLaMATransformer

        model = LLaMATransformer(model_name)
    elif MODEL_EXECUTOR == "transformers":
        from llama_transformer import LLaMATransformer

        model = LLaMATransformer(model_name)
    elif MODEL_EXECUTOR == "llamacpp":
        from llama_cpp_hf import LlamacppHF

        model = LlamacppHF(model_name)

    global memory
    memory = Memory(embedding_model)
    print("Memory initialized.")

    guidance.llms.Transformers.cache.clear()
    # guidance.llm = model
    guidance.llm = guidance.llms.transformers.LLaMA("remyxai/ffmperative-7b", device_map="auto")
    print(f"Token healing enabled: {guidance.llm.token_healing}")


Initializing ExLlamaGPTQ with model remyxai/ffmperative-7b
Fetching 10 files: 100%|██████████████████████████████| 10/10 [00:00<00:00, 34.08it/s]
Loading config from remyxai/ffmperative-7b/config.json
Traceback (most recent call last):
  File "/media/2TB/guider/guider_server.py", line 271, in <module>
    run()
  File "/media/2TB/guider/guider_server.py", line 262, in run
    setup_models(model_name)
  File "/media/2TB/guider/guider_server.py", line 58, in setup_models
    model = ExLLaMA(model_name)
  File "/home/oem/miniconda3/envs/torch2/lib/python3.10/site-packages/guidance/llms/_transformers.py", line 32, in __init__
    self.model_obj, self.tokenizer = self._model_and_tokenizer(model, tokenizer, **kwargs)
  File "/media/2TB/guider/llama_exllama.py", line 27, in _model_and_tokenizer
    (model, tokenizer) = _load(model_dir)
  File "/media/2TB/guider/llama_exllama.py", line 57, in _load
    exllama_hf = ExllamaHF.from_pretrained(model_dir)
  File "/media/2TB/guider/exllama_hf.py", line 112, in from_pretrained
    return ExllamaHF(config)
  File "/media/2TB/guider/exllama_hf.py", line 21, in __init__
    self.ex_model = ExLlama(self.ex_config)
  File "/media/2TB/guider/exllama/model.py", line 732, in __init__
    with safe_open(self.config.model_path, framework = "pt", device = "cpu") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

fullstackwebdev commented 7 months ago

sorry, I have 20k in my github inbox and didn't see your message

I have since forked the old version of guidance and updated the OpenAI API code

https://github.com/fullstackwebdev/handlebars-guidance

Unfortunately in the new guidance, I can not figure out how to do streaming. It's been a week and it's a holiday week so I'll give them more time.

fullstackwebdev commented 7 months ago

oh whoopts, the updated guidance code is in the 'master' branch on my repo

johndpope commented 7 months ago

they fixed code upstream https://github.com/guidance-ai/guidance/issues/328

andysalerno / guider

not an issue - fyi - https://github.com/fullstackwebdev/localLLM_guidance #1