L-Dot / Letterboxd-list-scraper

A program that can scrape Letterboxd lists from an input URL. The output CSV or JSON contains information about the film title, release year, director, cast, personal rating, average rating and a lot more.
MIT License
41 stars 11 forks source link
csv json letterboxd python scraper

Letterboxd-list-scraper

A tool for scraping Letterboxd lists from a simple URL. The output is a file with film titles, release year, director, cast, owner rating, average rating and a whole lot more (see example CSVs and JSONs in /example_output/).

Version v2.2.0 supports the scraping of:

The current scrape rate is about 1.2 films per second. Multiple lists can be concurrently scraped using separate CPU threads (default max of 4 threads, but this is configurable).

Getting Started

Dependencies

Requires Python 3.x, numpy, BeautifulSoup (bs4), requests, tqdm and lxml.

If dependencies are not met it is recommended to install everything needed in one go using pip install -r requirements.txt (ideally in a clean virtual environment).

Installing

Executing program

[!NOTE] Please use python -m listscraper --help for a full list of all available flags including extensive descriptions on how to use them.

[!TIP] Scraping multiple lists is most easily done by running python -m listscraper -f <file> with a custom .txt file that contains the URL on each newline. Each newline can take unique -p and -on optional flags. For an example of such a file please see target_lists.txt.

[!IMPORTANT] Program currently does not support the scraping of extremely long generic Letterboxd pages (e.g. https://letterboxd.com/films/popular/this/week/genre/documentary/, which contains ~152000 films). To circumvent this, please use the -p flag to make a smaller page selection.

TODO

Authors

Arno Lafontaine

Acknowledgments

Thanks to BBotml for the inspiration for this project https://github.com/BBottoml/Letterboxd-friend-ranker.