MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 336 forks source link

KeyLLM - Mistral token issue #204

Open sdspieg opened 4 months ago

sdspieg commented 4 months ago

Any ideas on why this code doesn't work? I was trying to use sliding windows, but it seems like Mistral doesn't support that. So then I tried truncating, and setting a higher token limit - but that doesn't seem to work either (and sorry for all of the print statements, but I am just trying to debug this)

import pandas as pd
from ctransformers import AutoModelForCausalLM
import time

# Start measuring overall execution time
overall_start_time = time.time()

# Load your DataFrame with the first 5 rows
print("Loading the DataFrame...")
dataframe_start_time = time.time()
df = pd.read_json('Russia/parliamint_russia_sents_paras.jsonl', lines=True)[:5]  # Adjust path as necessary
dataframe_end_time = time.time()
dataframe_loading_time = dataframe_end_time - dataframe_start_time
print(f"DataFrame Loaded in {dataframe_loading_time:.2f} seconds\n")

# Load the model
print("Loading the model...")
model_start_time = time.time()
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file="mistral-7b-v0.1.Q4_K_M.gguf", model_type="mistral", gpu_layers=50)
model_end_time = time.time()
model_loading_time = model_end_time - model_start_time
print(f"Model Loaded in {model_loading_time:.2f} seconds\n")

# Define your example and keyword prompts
example_prompt = """
<s>[INST]
I have the following document:
- Europe, ladies and gentlemen, the Community of European States, is multicultural. That is a fact, a circumstance that is to be accepted. And I think more: We are currently experiencing an ethnic-national earthquake that is changing Europe's political map more than the two world wars have done. 

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords (nounphrases with 1, 2 or 3 words) and say nothing else. For example, don't say: 
"Here are the keywords present in the document"
[/INST] Europe, ladies, gentlemen, Community of European States, multicultural earthquake, ethnic-national earthquake, map, political map, Europe's political map, world war</s>
"""

keyword_prompt_template = """
[INST]
I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords (nounphrases with 1, 2 or 3 words) and say nothing else. For example, don't say: 
"Here are the keywords present in the document"
[/INST]
"""

# Set the token limit for the Mistral model
max_new_tokens = 4096

# Initialize a list to store execution times
execution_times = []
generated_keywords = []  # To store generated keywords

# Process the first 5 rows
for i in range(5):
    # Select one row from the DataFrame
    row = df.iloc[i]

    # Extract the paragraph text
    para_text = row['para_text']

    # Truncate the paragraph to the token limit
    para_text = para_text[:max_new_tokens - len(example_prompt) - len(keyword_prompt_template)]

    # Construct the full prompt
    full_prompt = example_prompt + keyword_prompt_template.replace("[DOCUMENT]", para_text)

    # Print that processing is starting for this example
    print(f"Processing Example {i + 1}...\n")

    # Print inputs
    print(f"Input Paragraph Text {i + 1}:\n{para_text}\n")
    print(f"Full Prompt {i + 1}:\n{full_prompt}\n")

    # Measure the time it takes to generate a response
    start_time = time.time()
    response = llm(full_prompt)
    end_time = time.time()
    execution_time = end_time - start_time
    execution_times.append(execution_time)

    # Print the generated response
    print(f"Generated Response {i + 1}:\n{response}\n")
    print(f"Execution Time {i + 1}: {execution_time:.2f} seconds\n")

    # Extract keywords from the response (you may need to modify this part)
    # For now, we'll simply split the response into keywords
    keywords = response.split(", ")
    generated_keywords.append(keywords)

    # Print that processing for this example has finished
    print(f"Processing for Example {i + 1} finished\n")

# Calculate and print the total execution time
total_execution_time = sum(execution_times)
print(f"Total Execution Time for 5 rows: {total_execution_time:.2f} seconds")

# Measure overall execution time
overall_end_time = time.time()
overall_execution_time = overall_end_time - overall_start_time
print(f"Overall Execution Time: {overall_execution_time:.2f} seconds")

# Display the generated keywords for each example
for i, keywords in enumerate(generated_keywords):
    print(f"Generated Keywords for Example {i + 1}: {', '.join(keywords)}\n")

This is what I get. And it also takes 2m per input - not sure whether that's only because of the errors or not.

Model Loaded in 5.73 seconds

Processing Example 1...

Input Paragraph Text 1:
The dead – according to the results so far – are no longer in this place for about 50 years. Because of the bone remains, it must have been about 20 to 22 years of age, namely persons of male sex. The upper and lower jaw remains indicate that they are probably not refugees or prisoners of war from Russia because their teeth were usually in worse condition, as I am told.

Full Prompt 1:

<s>[INST]
I have the following document:
- Europe, ladies and gentlemen, the Community of European States, is multicultural. That is a fact, a circumstance that is to be accepted. And I think more: We are currently experiencing an ethnic-national earthquake that is changing Europe's political map more than the two world wars have done. 

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords (nounphrases with 1, 2 or 3 words) and say nothing else. For example, don't say: 
"Here are the keywords present in the document"
[/INST] Europe, ladies, gentlemen, Community of European States, multicultural earthquake, ethnic-national earthquake, map, political map, Europe's political map, world war</s>

[INST]
I have the following document:
- The dead – according to the results so far – are no longer in this place for about 50 years. Because of the bone remains, it must have been about 20 to 22 years of age, namely persons of male sex. The upper and lower jaw remains indicate that they are probably not refugees or prisoners of war from Russia because their teeth were usually in worse condition, as I am told.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords (nounphrases with 1, 2 or 3 words) and say nothing else. For example, don't say: 
"Here are the keywords present in the document"
[/INST]

Number of tokens (513) exceeded maximum context length (512).
Number of tokens (514) exceeded maximum context length (512).
Number of tokens (515) exceeded maximum context length (512).
Number of tokens (516) exceeded maximum context length (512).
Number of tokens (517) exceeded maximum context length (512).
Number of tokens (518) exceeded maximum context length (512).
Number of tokens (519) exceeded maximum context length (512).

Any suggestions? Or are there maybe other (/newer?) models that WOULD allow for sliding windows? Thanks!

MaartenGr commented 4 months ago

According to the ctransformers documentation, I think you will need to use the context_length parameter to increase the context length.

Doflamingoos commented 3 weeks ago

@sdspieg Was a solution found for this? Having same problem.