Open tshrinivasan opened 2 days ago
I follow three rules.
We need to first download all the film lyrics HTML files and store a hard drive backup for the full data. Then, we apply an offline script, for example, html2txt, to gather information such as music directors' names.
This is because the data contains a lot of information, including:
movie names
music directors' names
singers' names
song titles
lyrics.
Your suggestion for Recursive mode: one movie → 1st song's webpage to lyrics. In this mode, we lose a lot of information. So, download all the information first, then run the offline script.
I guess you haven't tried my script yet.??
// We need to first download all the film lyrics HTML files and store a hard drive backup for the full data.//
The full download may take time as per the README.
when the site updates their content with some 10 new movie pages, why we have to sync the entire website again?
How to optimize the redowonloading entire content again and again , when the site is updated.
Share your thoughts.
That site has three main blocks.
The first one is the Movies List
https://www.tamil2lyrics.com/movie/
https://www.tamil2lyrics.com/music-directors-list/
https://www.tamil2lyrics.com/singers-list/
The output data gathering script works based on the movie name list.
ex: movie -- > songs lyrics
URL: https://www.tamil2lyrics.com/movie/ https://www.tamil2lyrics.com/movie/page/1/ https://www.tamil2lyrics.com/movie/page/2/
to
https://www.tamil2lyrics.com/movie/page/271/
All films are arranged in alphabetical order. Each page has a list of 15 movies
https://www.tamil2lyrics.com/movie/page/1/ == 15 movie_names . . https://www.tamil2lyrics.com/movie/page/270/ == 15 movie_names https://www.tamil2lyrics.com/movie/page/271/ == 11 movie_names
Therefore, if the last page's movie count is not equal to 11, new movies will be updated.
Because we need the lyrics for that movie only.
I have already mentioned it.
For python3 main_1.py:
The code processes 270 pages per URL request, and each request takes 10 seconds.
Therefore, the total runtime required is approximately 45 minutes.
The code has already been run to collect all the film URLs in movie_urls.txt.
Every weekend, we scan the 271 pages and filter for new movies. We already have the total movie list, so we remove duplicate movies.
seq 1 10 > old
seq 5 15 > new
grep -Fxv -f old new
output is :
11
12
13
14
15
Then download the lyrics for that particular new movie. The time taken is 45 minutes every week; that's my theoretical conclusion..
As this is a long running script, it may be disconnected anytime.
There will be updates on the site too.
To avoid repeatedly downloading entire content, once dowloaded a page, add the URL in a text file "downloaded_urls.txt"
When downloading any page, first check on that file. if the url is there already, goto next page. if not there, then download and add the URL at the bottom of that file.