abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.78k stars 935 forks source link

[query] Inputting Json to llama cpp and getting output into txt file ? #1192

Open timtensor opened 7 months ago

timtensor commented 7 months ago

Hi I have a couple of json files that i want to input it to llmusing llama-cpp. I am wondering if there is a way to do that . I am planning to use mistral7b-instruct as a start.

The json file looks as follows

{
    "title": "Some title ",
    "description": "Some description",
    "reference": "web_url",
    "date": "xyz",
    "popularity": 0,
    "comments": [
        {
            "number": 1,
            "content": "Some Text1"
        },
        {
            "number": 2,
            "content": "Some Text2"
        }
    ]
}

I want to use llama cpp to pass

./main -m 'mistral mode ' - p "Describe briefly the problem {description} , summarize the comments in 10 bullet points {comments} 

And output the llm into maybe a text with format

  1. title
  2. description
  3. summary

Is it possible to do it via llama cpp or is it only possible through python bindings. i thought it would be much faster with llama cpp without the python bindings.

Has anyone tried something similar ?

mathewpan2 commented 7 months ago

It'll be quite a lot easier to do what you want with the python bindings. I recommend giving the docs a quick read so you know how to do inferences in python code. Below is a quick example on doing inference with the bindings.

>>> from llama_cpp import Llama
>>> llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
>>> output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
>>> print(output)
{
  "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1679561337,
  "model": "./models/7B/llama-model.gguf",
  "choices": [
    {
      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
      "index": 0,
      "logprobs": None,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 28,
    "total_tokens": 42
  }
}

Basically I'd just make a python script that parses the jsons and then feed it to the Llama class for inference. You can then take the output and reformat it however you want and then write it to a txt file. If you're not too familiar with programming chatgpt could probably write this for you.

timtensor commented 7 months ago

Hi @mathewpan2] , I actually have now a single json with different tags. I can extract information from the json which i thought i could as you said feed it to the llm module for inference. However i am not getting the output correctly . For example , extracts all the comments. test_description =data[0]['description'] This i input it to the llm as follows setup llm


_llm = Llama(
    model_path="./mistral-7b-instruct-v0.2.Q4_0.gguf",
    n_gpu_layers=-1 if enable_gpu else 0,
    n_ctx=2048,
)
_llm.verbose = False
llm = LlamaCppLLM(name="Mistral-7B-v0.2-Instruct", llm=_llm, max_tokens=None)

and ask for summary as follows
_llm(f'Summarize the following {test_description}')

However i keep getting arbitary answers i am not sure what is the right way to be honest

mathewpan2 commented 7 months ago

Can you post your entire code?

timtensor commented 7 months ago

Hi yes , I actually switched to llama index to be honest . But the json string it still cannot parse properly. Not sure if it helps

from llama_index.core import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM  
from llama_index.core import download_loader
from llama_index.core import ServiceContext
from llama_index.core import VectorStoreIndex
from llama_index.core import SummaryIndex
from llama_index.core.response.notebook_utils import display_response
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.node_parser import JSONNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path
import locale
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.embeddings import resolve_embed_model
locale.getpreferredencoding = lambda: "UTF-8"
# Load your JSON data
json_docs = FlatReader().load_data(Path("/content/data/smalldata.json"))
# Initialize the JSONNodeParser
parser = JSONNodeParser()
# Parse the loaded data
nodes = parser.get_nodes_from_documents(json_docs)
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

llm = HuggingFaceLLM(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.2",
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.2, "top_k": 5, "top_p": 0.95},
    device_map="auto",
)

embed_model = resolve_embed_model("local:thenlper/gte-small")

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=200, chunk_overlap=70),
        embed_model,
    ]
)

# Assuming you have already parsed your documents into nodes
nodes = parser.get_nodes_from_documents(json_docs)

# run the pipeline
processed_nodes = pipeline.run(documents=nodes)

# Index setup
vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)
# Summary setup
summary_index = SummaryIndex.from_documents(documents, service_context=service_context)

import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#Query Engine
query_engine = vector_index.as_query_engine(response_mode="compact")
response = query_engine.query("How many titles are there?")
display_response(response)