fabriziosalmi / blacklists

Hourly updated domains blacklist 🚫
https://github.com/fabriziosalmi/blacklists/releases/download/latest/blacklist.txt
GNU General Public License v3.0
133 stars 6 forks source link

http cache tips #39

Closed fabriziosalmi closed 1 year ago

fabriziosalmi commented 1 year ago

Certainly! After reviewing the linked README, I see you want to efficiently check if a remote file has changed to decide whether to fetch it or not. Here's how you can improve the existing approach:

Using ETag and Last-Modified Headers

Many web servers use ETag and Last-Modified headers to signal when content has changed. When you fetch a resource, the server often sends these headers in its response. By saving these headers' values and sending them in subsequent requests, the server can tell you whether the content has changed.

Here's an example of how you might integrate this approach into your script:

  1. Save the ETag and Last-Modified headers (if they exist) after fetching a file.
  2. On the next run, send a request with these headers' values to see if the file has changed.

Here's a sample Python script that demonstrates this:

import requests
import os

# File URL
URL = "https://get.domainsblacklists.com/blacklist.txt"

# Headers file
HEADERS_FILE = "headers.txt"

def get_saved_headers():
    if os.path.exists(HEADERS_FILE):
        with open(HEADERS_FILE, 'r') as f:
            headers = {
                "If-None-Match": f.readline().strip(),
                "If-Modified-Since": f.readline().strip()
            }
            return headers
    return {}

def save_headers(response_headers):
    with open(HEADERS_FILE, 'w') as f:
        f.write(response_headers.get('ETag', '') + "\n")
        f.write(response_headers.get('Last-Modified', '') + "\n")

def fetch_blacklist_txt():
    headers = get_saved_headers()
    response = requests.get(URL, headers=headers)

    # If status is 304 Not Modified, there's no need to download
    if response.status_code == 304:
        print("File hasn't changed.")
        return

    # Otherwise, save the new file and update headers
    with open("blacklist.txt", "w") as file:
        file.write(response.text)

    save_headers(response.headers)

# Rest of your script...

if __name__ == "__main__":
    fetch_blacklist_txt()
    # ... other tasks ...

This script will efficiently check if the remote file has changed by taking advantage of HTTP caching headers. The benefits are:

This approach is commonly used for optimizing requests and is considered a best practice.