Tpj-root / site_cloner

0 stars 0 forks source link

Create a record on already downloaded pages #2

Open tshrinivasan opened 2 days ago

tshrinivasan commented 2 days ago

As this is a long running script, it may be disconnected anytime.

There will be updates on the site too.

To avoid repeatedly downloading entire content, once dowloaded a page, add the URL in a text file "downloaded_urls.txt"

When downloading any page, first check on that file. if the url is there already, goto next page. if not there, then download and add the URL at the bottom of that file.

Tpj-root commented 2 days ago

BD

I follow three rules.

  1. Information Gathering (Full_site_dump)
  2. Scanning (lyrics, music_directors_list, singers_list, movies_name_list)
  3. Extraction (html2txt)

We need to first download all the film lyrics HTML files and store a hard drive backup for the full data. Then, we apply an offline script, for example, html2txt, to gather information such as music directors' names.

This is because the data contains a lot of information, including:
    movie names
    music directors' names
    singers' names
    song titles
    lyrics.

Your suggestion for Recursive mode: one movie → 1st song's webpage to lyrics. In this mode, we lose a lot of information. So, download all the information first, then run the offline script.

I guess you haven't tried my script yet.??

tshrinivasan commented 1 day ago

// We need to first download all the film lyrics HTML files and store a hard drive backup for the full data.//

The full download may take time as per the README.

when the site updates their content with some 10 new movie pages, why we have to sync the entire website again?

How to optimize the redowonloading entire content again and again , when the site is updated.

Share your thoughts.

Tpj-root commented 1 day ago

Sync for new data. How do we find the name of the new film?

That site has three main blocks.

The first one is the Movies List

https://www.tamil2lyrics.com/movie/
https://www.tamil2lyrics.com/music-directors-list/
https://www.tamil2lyrics.com/singers-list/

The output data gathering script works based on the movie name list.

ex: movie -- > songs lyrics

URL: https://www.tamil2lyrics.com/movie/ https://www.tamil2lyrics.com/movie/page/1/ https://www.tamil2lyrics.com/movie/page/2/

to

https://www.tamil2lyrics.com/movie/page/271/

All films are arranged in alphabetical order. Each page has a list of 15 movies

https://www.tamil2lyrics.com/movie/page/1/ == 15 movie_names . . https://www.tamil2lyrics.com/movie/page/270/ == 15 movie_names https://www.tamil2lyrics.com/movie/page/271/ == 11 movie_names

Therefore, if the last page's movie count is not equal to 11, new movies will be updated.

But how do we find that movie?

Because we need the lyrics for that movie only.

I have already mentioned it.

For python3 main_1.py:

The code processes 270 pages per URL request, and each request takes 10 seconds. 
Therefore, the total runtime required is approximately 45 minutes. 
The code has already been run to collect all the film URLs in movie_urls.txt.

Every weekend, we scan the 271 pages and filter for new movies. We already have the total movie list, so we remove duplicate movies.

seq 1 10 > old
seq 5 15 > new

grep -Fxv -f old new
output is : 
11
12
13
14
15

Then download the lyrics for that particular new movie. The time taken is 45 minutes every week; that's my theoretical conclusion..