KeyLLM keyword extraction issue

ksachdeva11 commented 11 months ago

KeyLLM seems to be extracting keywords which are not even present in the document used. I am following the steps mentioned in this article - https://towardsdatascience.com/introducing-keyllm-keyword-extraction-with-llms-39924b504813

I am using Mistral 7B model.

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

documents = [
"As discussed above, for the training set, finer-grained instances in the training set are generally better than coarser-grained ones. This preference does not apply to classification time, i.e. the use of the classifier in the field. We should go ahead and predict the sentiment of whatever text we are given, be it a sentence or a chapter.",
"I received my package!",
"You clearly want to know what is being complained about and what is being liked."
]

keywords = kw_model.extract_keywords(documents); keywords

Output -

[['discussed',
  'above',
  'finer-grained',
  'instances',
  'training',
  'set',
  'better',
  'coarser-grained',
  'preference',
  'applies',
  'classification',
  'time',
  'field',
  'predict',
  'sentiment',
  'text',
  'sentence',
  'chapter.'],
 ['package',
  'received',
  'delivery',
  'shipment',
  'mail',
  'courier',
  'product',
  'order',
  'online',
  'store'],
 ['complained',
  'liked',
  'want',
  'know',
  'clear',
  'understand',
  'specific',
  'detail',
  'issue',
  'problem',
  'feedback',
  'opinion',
  'satisfaction',
  'enjoyment',
  'appreciation',
  'preference',
  'dislike',
  'dissatisfaction',
  'negative',
  'positive',
  'favorable',
  'unf']]

It seems to be extracting similar words even though they are not present in the original document. Seems like model specific issue?

MaartenGr commented 11 months ago

Thank you for sharing this! The LLM indeed plays a role in extracting the type of keywords, whether they are present or not in the original document. However, the main culprit here is the prompt in itself. By tweaking the prompt you can ask the LLM to only extract keywords that are literally found in the text and not to come up with different ones.

I would advise looking at the documentation here which illustrates this with an example.

ksachdeva11 commented 11 months ago

Got it.. thank you for your quick response!

Bolive84 commented 11 months ago

Hi Maarten,

For some reason, when using check_vocab to get words that appear in the documents, and the exact same code as in the documentation I receive different results, here is what I get:

[[], [], ['Meta released', "LLaMA's model"]]

Is there anything that can explain that result?

MaartenGr commented 11 months ago

@Bolive84 Could you share your full code? Without it, it is difficult to say what exactly is happening here.

Bolive84 commented 11 months ago

Hi @MaartenGr, thanks for your reply, the code I use is the one that is provided on the tutorial (just masking my API key for security reasons):

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM

# Create your LLM
openai.api_key = xxxx

prompt = """
I have the following document:
[DOCUMENT]

Based on the information above, extract the keywords that best describe the topic of the text.
Make sure to only extract keywords that appear in the text.
Use the following format separated by commas:
<keywords>
"""
llm = OpenAI()

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, check_vocab=True)
keywords

MaartenGr commented 10 months ago

@Bolive84 It might just be that OpenAI tends not to extract the exact keywords that appear in the text. Could you try with and without check_vocab=True to see the difference between output?

MaartenGr / KeyBERT

KeyLLM keyword extraction issue #183