Open ksachdeva11 opened 11 months ago
Thank you for sharing this! The LLM indeed plays a role in extracting the type of keywords, whether they are present or not in the original document. However, the main culprit here is the prompt in itself. By tweaking the prompt you can ask the LLM to only extract keywords that are literally found in the text and not to come up with different ones.
I would advise looking at the documentation here which illustrates this with an example.
Got it.. thank you for your quick response!
Hi Maarten,
For some reason, when using check_vocab to get words that appear in the documents, and the exact same code as in the documentation I receive different results, here is what I get:
[[], [], ['Meta released', "LLaMA's model"]]
Is there anything that can explain that result?
@Bolive84 Could you share your full code? Without it, it is difficult to say what exactly is happening here.
Hi @MaartenGr, thanks for your reply, the code I use is the one that is provided on the tutorial (just masking my API key for security reasons):
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM
# Create your LLM
openai.api_key = xxxx
prompt = """
I have the following document:
[DOCUMENT]
Based on the information above, extract the keywords that best describe the topic of the text.
Make sure to only extract keywords that appear in the text.
Use the following format separated by commas:
<keywords>
"""
llm = OpenAI()
# Load it in KeyLLM
kw_model = KeyLLM(llm)
# Extract keywords
keywords = kw_model.extract_keywords(documents, check_vocab=True)
keywords
@Bolive84 It might just be that OpenAI tends not to extract the exact keywords that appear in the text. Could you try with and without check_vocab=True
to see the difference between output?
KeyLLM seems to be extracting keywords which are not even present in the document used. I am following the steps mentioned in this article - https://towardsdatascience.com/introducing-keyllm-keyword-extraction-with-llms-39924b504813
I am using Mistral 7B model.
Output -
It seems to be extracting similar words even though they are not present in the original document. Seems like model specific issue?