Enhance Web Crawler with LLM for Movie Information Retrieval

Enhancement: Integrate Open-Source LLM for Movie Information Retrieval

Description

Enhance the existing web crawler to utilize an open-source Language Model (LLM) to fetch and display detailed movie information based on user input. The information should include:

Movie summary
Reviews
Runtime
Reasons to watch

Provide an option for the user to choose which LLM they want to use for their search.

Tasks

Integrate Open-Source LLM API:
- Use an open-source LLM like LLaMA or Mistral to fetch movie information.
- Create a function to query the chosen LLM with the movie name and retrieve the required details.
Create User Input Interface:
- Implement a simple terminal-based input for users to enter the movie name and choose the LLM.
- Validate the input to ensure it is not empty.
Fetch Initial Data Using Web Crawler:
- Use the existing web crawler to fetch initial data such as movie URLs, basic info, and reviews.
- Pass this data as context to the LLM to enhance its response.
Fetch and Display Movie Information:
- Use the chosen LLM to fetch movie summary, reviews, runtime, and reasons to watch.
- Display the fetched information in a user-friendly format.
Surprise Enhancement: Movie Recommendations:
- Use the LLM to generate a list of similar movies based on the user's input.
- Display the recommended movies along with the fetched information.
Update requirements.txt:
- Add the transformers library to requirements.txt.
Create README.md:
- Add setup and run instructions to a new README.md file.

Implementation Details

File: src/movie_info.py
Function to Add: get_movie_info_from_llm(movie_name: str, initial_data: dict, llm_choice: str) -> dict
- This function will query the chosen LLM with the movie name and initial data, and return a dictionary with keys: summary, reviews, runtime, reasons_to_watch, and recommendations.

Example Code:

import requests
from bs4 import BeautifulSoup
from transformers import pipeline

def get_initial_movie_data(movie_name: str) -> dict:
  # Example function to fetch initial data using web crawler
  url = f"https://www.rottentomatoes.com/search?search={movie_name}"
  response = requests.get(url)
  soup = BeautifulSoup(response.content, 'lxml')
  movie_url = soup.find('search-page-media-row').get('href')
  movie_page = requests.get(f"https://www.rottentomatoes.com{movie_url}")
  movie_soup = BeautifulSoup(movie_page.content, 'lxml')
  summary = movie_soup.find('div', {'class': 'movie_synopsis'}).text.strip()
  reviews = [review.text.strip() for review in movie_soup.find_all('blockquote')]
  runtime = movie_soup.find('time').text.strip()
  return {
      'url': movie_url,
      'summary': summary,
      'reviews': reviews,
      'runtime': runtime
  }

def get_movie_info_from_llm(movie_name: str, initial_data: dict, llm_choice: str) -> dict:
  if llm_choice == 'llama':
      model_name = 'meta-llama/Meta-Llama-3.1-8B'
  elif llm_choice == 'mistral':
      model_name = 'mistralai/Mistral-Large-Instruct-2407'
  else:
      raise ValueError("Unsupported LLM choice")

  generator = pipeline('text-generation', model=model_name)
  prompt = (
      f"Using the following initial data about the movie {movie_name}:\n"
      f"Summary: {initial_data['summary']}\n"
      f"Reviews: {initial_data['reviews']}\n"
      f"Runtime: {initial_data['runtime']}\n"
      "Provide a detailed summary, additional reviews, runtime, reasons to watch, and recommend similar movies."
  )
  response = generator(prompt, max_length=500)
  return response[0]['generated_text']

if __name__ == "__main__":
  movie_name = input("Enter the movie name: ").strip()
  llm_choice = input("Enter the LLM to use (llama/mistral): ").strip().lower()
  if movie_name and llm_choice:
      initial_data = get_initial_movie_data(movie_name)
      movie_info = get_movie_info_from_llm(movie_name, initial_data, llm_choice)
      print(f"Summary: {movie_info['summary']}")
      print(f"Reviews: {movie_info['reviews']}")
      print(f"Runtime: {movie_info['runtime']}")
      print(f"Reasons to Watch: {movie_info['reasons_to_watch']}")
      print(f"Recommendations: {movie_info['recommendations']}")
  else:
      print("Please enter a valid movie name and LLM choice.")

Update requirements.txt:

beautifulsoup4
requests
lxml
transformers

Create README.md:

# Web Crawler with LLM Integration

This project is a web crawler that fetches movie information and enhances it using a Language Model (LLM) to provide detailed summaries, reviews, runtime, reasons to watch, and recommendations.

## Setup

1. Clone the repository:
 ```bash
 git clone https://github.com/yourusername/web_crawler.git
 cd web_crawler

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the dependencies:
```
pip install -r requirements.txt
```
Set up your LLM API key (if required):
- For OpenAI's GPT-3, set the OPENAI_API_KEY environment variable.

Usage

Run the script:
```
python src/movie_info.py
```
Enter the movie name and choose the LLM (llama/mistral) when prompted.

Notes

Ensure to handle API errors and edge cases where the movie information might not be available.
Consider adding unit tests for the new functionality.

References

Hugging Face Transformers Documentation

Notes

Ensure to handle API errors and edge cases where the movie information might not be available.
Consider adding unit tests for the new functionality.

References

Hugging Face Transformers Documentation

Jsiewierski11 / web_crawler

Enhance Web Crawler with LLM for Movie Information Retrieval #1

Enhancement: Integrate Open-Source LLM for Movie Information Retrieval

Description

Tasks

Implementation Details

Usage

Notes

References

Notes

References