Adding support for scraping Genres, Runtime, Watches, and Likes

L-Dot / Letterboxd-list-scraper

A program that can scrape Letterboxd lists from an input URL. The output CSV or JSON contains information about the film title, release year, director, cast, personal rating, average rating and a lot more.

MIT License

41 stars 11 forks source link

Adding support for scraping Genres, Runtime, Watches, and Likes #3

Closed DenJackson42 closed 7 months ago

DenJackson42 commented 7 months ago

The biggest addition I've made is adding the capability to scrape each film's genres, runtime, how many people have watched it, and how many people have liked it. These are all new columns added to the film_rows lists that is turned into the resulting dataset.

I couldn't get personal ratings to work unless I gave a list with a detailed view, so I also updated main.py to assume a detailed list is being given (same url but with /detail on the end). Also updated the README to reflect this new change.

Lastly to fix UnicodeEncodeErrors I added a UTF-8 encoding argument in csv_writer.py that resolves the issue.

L-Dot commented 7 months ago

Thank you so much for this! These are nice and minimal solutions to some of the problems/requests that other people brought up and I was planning to add myself (but did not get around to).

I have merged your code with the main branch. Much appreciated kind stranger :)

DenJackson42 commented 7 months ago

You're welcome! One thing I forgot to mention is that rated-movies.csv is the list I was testing with, on second thought I should have probably used the IMDB top 250 list like you did and just uploaded that so there aren't two examples now. That could maybe be a next step, deleting the rated_movies.csv and updating imdb-top-250.csv. Not a huge deal, just something I forgot.

L-Dot commented 7 months ago

I just uploaded version 1.1 of the app, including a new imdb-top-250.csv!

One thing I did change was the use of /detail, which I have now included in the code itself (only if the provided list is not a watchlist). This keeps the user-side as simple as possible, which I am a fan of. The rest of your changes are still in there 😄.