bitdruid / python-wayback-machine-downloader

Query and download archive.org as simple as possible.
MIT License
24 stars 1 forks source link

Hello你好,还是一样 #4

Closed lostmagicblue closed 3 months ago

lostmagicblue commented 3 months ago

2024-6-4 8-41-16

Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib64/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 169, in download_loop download_status = download(output, snapshot, connection, status, no_redirect) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 228, in download with open(download_file, 'wb') as file: IsADirectoryError: [Errno 21] Is a directory: '/root/downloader/waybackup_snapshots/bbs.eastsea.com.cn/static/image/common' 这个也是 Traceback (most recent call last): File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib64/python3.8/threading.py", line 870, in run self._target(self._args, **self._kwargs) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 169, in download_loop download_status = download(output, snapshot, connection, status, no_redirect) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 228, in download with open(download_file, 'wb') as file: IsADirectoryError: [Errno 21] Is a directory: '/root/waybackup_snapshots/bbs.tianmu.com/simple'-bash: Exception: command not found

If you are free, please take a look, thank you

Ghost-chu commented 3 months ago

I can also reproduce it with same error, the workers crashed one by one and pages/s seems gets down.

bitdruid commented 3 months ago

one of the problems seems still the url-parsing.

this snapshot does except: http://web.archive.org/web/20210318001217id_/http://bbs.eastsea.com.cn/static/image/common

a blank page

because the cdx returns ...."/common" the parsing does "common" handle as a file but it was added as a folder before (contains images). i think this problem does only arise on the --current download as it merges multiple versions of the folders.

Ghost-chu commented 3 months ago

one of the problems seems still the url-parsing.

this snapshot does except: http://web.archive.org/web/20210318001217id_/http://bbs.eastsea.com.cn/static/image/common

a blank page

because the cdx returns ...."/common" the parsing does "common" handle as a file but it was added as a folder before (contains images). i think this problem does only arise on the --current download as it merges multiple versions of the folders.

we could save it to a special file like _____ROOT_____

bitdruid commented 3 months ago

jup, my first thought was just add an index.html if the folder itself already exists. 🤔

However, after some inspection, the above snapshot contains a redirect to ".../common/" and the parsing would already handle that case and create an index.html. so the problem here is that the current logic uses the snapshot served by the cdx-server to generate the output instead of the redirected url.

I'm going to change the output generation to fix this, but the snapshots will still be stored in the timestamp originally sent by the CDX server, as I think this would be the best representation of the result served by the CDX server for the user's query.

lostmagicblue commented 3 months ago

Thank you very much. Looking forward to your good news

bitdruid commented 3 months ago

@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far

lostmagicblue commented 3 months ago

ok了,thank you 牛!!!,厉害!!!

Ghost-chu commented 3 months ago

@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far

Thanks, I'm trying dev branching. As it stands, the URL problem has been fixed.

However I'm troubleshooting and while pulling oversized sites, the program is having issues stopping responding and going into a dead loop (CPU: 100%)

bitdruid commented 3 months ago

@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far

Thanks, I'm trying dev branching. As it stands, the URL problem has been fixed.

However I'm troubleshooting and while pulling oversized sites, the program is having issues stopping responding and going into a dead loop (CPU: 100%)

could you provide a snapshot which causes this issue?

Ghost-chu commented 3 months ago

@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far

Thanks, I'm trying dev branching. As it stands, the URL problem has been fixed. However I'm troubleshooting and while pulling oversized sites, the program is having issues stopping responding and going into a dead loop (CPU: 100%)

could you provide a snapshot which causes this issue?

Unfortunately, the console does not output any error messages. It just progress bar is no longer updated. top command shows 100% CPU usage (single core fully occupied)

Ghost-chu commented 3 months ago

Now I've made some changes to the code to add some checks at locations where loops may be generated and hopefully it will indicate where the problem is occurring. I am in the process of re-pulling the data on the wayback machine and trying to reproduce the issue.

Another thing that I would very much like is for the program to be able to remember files that have been downloaded. The site I am working on has 640281 (current version only) files, so every time the program crashes, everything gets reset.
This obviously wastes time and network traffic, and puts more load on archive.org's servers.

bitdruid commented 3 months ago

good feature. we could use the existing csv or just put every successed DL url into a temp txt

Ghost-chu commented 3 months ago

image

After running for 50mins, it get stuck again, here is screenshot.

Ghost-chu commented 3 months ago
sudo -E ~/.local/bin/pystack remote 2315089 --locals --no-block
Traceback for thread 2315103 (waybackup) [] (most recent call last):
    (Python) File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
      Arguments:
        self: <Thread at 0x7fcfddd12ca0>
    (Python) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
      Arguments:
        self: <Thread at 0x7fcfddd12ca0>
    (Python) File "/usr/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      Arguments:
        self: <Thread at 0x7fcfddd12ca0>
    (Python) File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 178, in download_loop
        status = f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{snapshot_batch.index(snapshot)+1}/{len(snapshot_batch)}] - Worker: {worker}"

I hope this can be helpful.

And here is modified archive.py, i just added some checks and didn't change the code logic.

import requests
import os
import gzip
import threading
import time
import http.client
from urllib.parse import urljoin
from datetime import datetime, timezone

from pywaybackup.helper import url_get_timestamp, url_split, file_move_index

from pywaybackup.SnapshotCollection import SnapshotCollection as sc

from pywaybackup.Verbosity import Verbosity as v

# GET: store page to wayback machine and response with redirect to snapshot
# POST: store page to wayback machine and response with wayback machine status-page
# tag_jobid = '<script>spn.watchJob("spn2-%s", "/_static/",6000);</script>'
# tag_result_timeout = '<p>The same snapshot had been made %s minutes ago. You can make new capture of this URL after 1 hour.</p>'
# tag_result_success = ' A snapshot was captured. Visit page: <a href="%s">%s</a>'
def save_page(url: str):
    """
    Saves a webpage to the Wayback Machine. 

    Args:
        url (str): The URL of the webpage to be saved.

    Returns:
        None: The function does not return any value. It only prints messages to the console.
    """
    v.write("\nSaving page to the Wayback Machine...")
    connection = http.client.HTTPSConnection("web.archive.org")
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
    }
    connection.request("GET", f"https://web.archive.org/save/{url}", headers=headers)
    v.write("\n-----> Request sent")
    response = connection.getresponse()
    response_status = response.status

    if response_status == 302:
        location = response.getheader("Location")
        v.write("\n-----> Response: 302 (redirect to snapshot)")
        snapshot_timestamp = datetime.strptime(url_get_timestamp(location), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S')
        current_timestamp = datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')
        timestamp_difference = (datetime.strptime(current_timestamp, '%Y-%m-%d %H:%M:%S') - datetime.strptime(snapshot_timestamp, '%Y-%m-%d %H:%M:%S')).seconds / 60
        timestamp_difference = int(round(timestamp_difference, 0))

        if timestamp_difference < 1:
            v.write("\n-----> New snapshot created")
        elif timestamp_difference > 1:
            v.write(f"\n-----> Snapshot already exists. (1 hour limit) - wait for {60 - timestamp_difference} minutes")
            v.write(f"TIMESTAMP SNAPSHOT: {snapshot_timestamp}")
            v.write(f"TIMESTAMP REQUEST : {current_timestamp}")
            v.write(f"\nLAST SNAPSHOT BACK: {timestamp_difference} minutes")

        v.write(f"\nURL: {location}")

    elif response_status == 404:
        v.write("\n-----> Response: 404 (not found)")
        v.write(f"\nFAILED -> URL: {url}")
    else:
        v.write("\n-----> Response: unexpected")
        v.write(f"\nFAILED -> URL: {url}")

    connection.close()

def print_list(csv: str = None):
    v.write("")
    count = sc.count_list()
    if csv:
        csv_header(csv)
        for snapshot in sc.SNAPSHOT_COLLECTION:
            csv_write(csv, snapshot)
    if count == 0:
        v.write("\nNo snapshots found")
    else:
        __import__('pprint').pprint(sc.SNAPSHOT_COLLECTION)
        v.write(f"\n-----> {count} snapshots listed")

# create filelist
# timestamp format yyyyMMddhhmmss
def query_list(url: str, range: int, start: int, end: int, explicit: bool, mode: str):
    try:
        v.write("\nQuerying snapshots...")
        query_range = ""

        if not range:
            if start: query_range = query_range + f"&from={start}"
            if end: query_range = query_range + f"&to={end}"
        else: 
            query_range = "&from=" + str(datetime.now().year - range)

        # parse user input url and create according cdx url
        domain, subdir, filename = url_split(url)
        if domain and not subdir and not filename:
            cdx_url = f"*.{domain}/*" if not explicit else f"{domain}"
        if domain and subdir and not filename:
            cdx_url = f"{domain}/{subdir}/*"
        if domain and subdir and filename:
            cdx_url = f"{domain}/{subdir}/{filename}/*"
        if domain and not subdir and filename:
            cdx_url = f"{domain}/{filename}/*"

        v.write(f"---> {cdx_url}")
        cdxQuery = f"https://web.archive.org/cdx/search/xd?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original&filter!=statuscode:200"
        cdxResult = requests.get(cdxQuery, timeout=25)
        sc.create_list(cdxResult, mode)
        v.write(f"\n-----> {sc.count_list()} snapshots found")
    except requests.exceptions.ConnectionError as e:
        v.write(f"\n-----> ERROR: could not query snapshots:\n{e}"); exit()

# example download: http://web.archive.org/web/20190815104545id_/https://www.google.com/
def download_list(output, retry, no_redirect, workers, csv: str = None):
    """
    Download a list of urls in format: [{"timestamp": "20190815104545", "url": "https://www.google.com/"}]
    """
    if sc.count_list() == 0: 
        v.write("\nNothing to download");
        return
    v.write("\nDownloading snapshots...", progress=0)
    if workers > 1:
        v.write(f"\n-----> Simultaneous downloads: {workers}")
        batch_size = sc.count_list() // workers + 1
    else:
        batch_size = sc.count_list()
    sc.create_collection()
    v.write("\n-----> Snapshots prepared")
    if csv:
        csv_header(csv)
    batch_list = [sc.SNAPSHOT_COLLECTION[i:i + batch_size] for i in range(0, len(sc.SNAPSHOT_COLLECTION), batch_size)]    
    threads = []
    worker = 0
    for batch in batch_list:
        worker += 1
        thread = threading.Thread(target=download_loop, args=(batch, output, workers, retry, no_redirect, csv))
        threads.append(thread)
        thread.start()
    for thread in threads:
        thread.join()

def download_loop(snapshot_batch, output, worker, retry, no_redirect, csv=None, attempt=1, connection=None, depth=0):
    """
    Download a list of URLs in a recursive loop. If a download fails, the function will retry the download.
    The "snapshot_collection" dictionary will be updated with the download status and file information.
    Information for each entry is written by "create_entry" and "snapshot_dict_append" functions.
    """
    if depth >= 10:
        return
    max_attempt = retry if retry > 0 else retry + 1
    failed_urls = []
    if not connection:
        connection = http.client.HTTPSConnection("web.archive.org", timeout=25)
    if attempt > max_attempt:
        connection.close()
        v.write(f"\n-----> Worker: {worker} - Failed downloads: {len(snapshot_batch)}")
        return
    for snapshot in snapshot_batch:
        status = f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{snapshot_batch.index(snapshot)+1}/{len(snapshot_batch)}] - Worker: {worker}"
        download_status = download(output, snapshot, connection, status, no_redirect, csv)
        if not download_status:
            failed_urls.append(snapshot)
        if download_status:
            v.write(progress=1)
    attempt += 1
    if failed_urls:
        if not attempt > max_attempt: 
            v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
            time.sleep(15)
        download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection, depth+1)

def download(output, snapshot_entry, connection, status_message, no_redirect=False, csv=None):
    """
    Download a single URL and save it to the specified filepath.
    If there is a redirect, the function will follow the redirect and update the download URL.
    gzip decompression is used if the response is encoded.
    According to the response status, the function will write a status message to the console and append a failed URL.
    """
    download_url = snapshot_entry["url_archive"]
    max_retries = 2
    sleep_time = 45
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
    for i in range(max_retries):
        try:
            connection.request("GET", download_url, headers=headers)
            response = connection.getresponse()
            response_data = response.read()
            response_status = response.status
            response_status_message = parse_response_code(response_status)
            sc.snapshot_entry_modify(snapshot_entry, "response", response_status)
            if not no_redirect:
                if response_status == 302:
                    status_message = f"{status_message}\n" + \
                        f"REDIRECT   -> HTTP: {response.status} - {response_status_message}\n" + \
                        f"           -> FROM: {download_url}"
                    redirect_times = 0
                    while response_status == 302:
                        redirect_times = redirect_times + 1
                        if redirect_times > 10:
                            print("Stop! Too many redirects for url" + download_url+" skipping!")
                            break
                        connection.request("GET", download_url, headers=headers)
                        response = connection.getresponse()
                        response_data = response.read()
                        response_status = response.status
                        response_status_message = parse_response_code(response_status)
                        location = response.getheader("Location")
                        if location:
                            download_url = urljoin(download_url, location)
                            status_message = f"{status_message}\n" + \
                                f"           ->   TO: {download_url}"
                            sc.snapshot_entry_modify(snapshot_entry, "redirect_timestamp", url_get_timestamp(location))
                            sc.snapshot_entry_modify(snapshot_entry, "redirect_url", download_url)
                        else:
                            break
            if response_status == 200:
                output_file = sc.create_output(download_url, snapshot_entry["timestamp"], output)
                output_path = os.path.dirname(output_file)
                if os.path.isfile(output_path): 
                    file_move_index(output_path)
                else: 
                    os.makedirs(output_path, exist_ok=True)

                if not os.path.isfile(output_file):
                    with open(output_file, 'wb') as file:
                        if response.getheader('Content-Encoding') == 'gzip':
                            response_data = gzip.decompress(response_data)
                            file.write(response_data)
                        else:
                            file.write(response_data)
                    if os.path.isfile(output_file):
                        status_message = f"{status_message}\n" + \
                            f"SUCCESS    -> HTTP: {response_status} - {response_status_message}"
                        sc.snapshot_entry_modify(snapshot_entry, "file", output_file)
                        csv_write(csv, snapshot_entry) if csv else None

                else:
                    status_message = f"{status_message}\n" + \
                        f"EXISTING   -> HTTP: {response_status} - {response_status_message}"
                status_message = f"{status_message}\n" + \
                    f"           -> URL: {download_url}\n" + \
                    f"           -> FILE: {output_file}"
                v.write(status_message)
                return True

            else:
                status_message = f"{status_message}\n" + \
                    f"UNEXPECTED -> HTTP: {response_status} - {response_status_message}\n" + \
                    f"           -> URL: {download_url}"
                v.write(status_message)
                return True
        # exception returns false and appends the url to the failed list
        except http.client.HTTPException as e:
            status_message = f"{status_message}\n" + \
                f"EXCEPTION -> ({i+1}/{max_retries}), append to failed_urls: {download_url}\n" + \
                f"          -> {e}"
            v.write(status_message)
            return False
        # connection refused waits and retries
        except ConnectionRefusedError as e:
            status_message = f"{status_message}\n" + \
                f"REFUSED  -> ({i+1}/{max_retries}), reconnect in {sleep_time} seconds...\n" + \
                f"         -> {e}"
            v.write(status_message)
            time.sleep(sleep_time)
        except BaseException as e:
            v.write('Unable to finish request, error: '+repr(e))
            time.sleep(sleep_time)
    v.write(f"FAILED  -> download, append to failed_urls: {download_url}")
    return False

RESPONSE_CODE_DICT = {
    200: "OK",
    301: "Moved Permanently",
    302: "Found (redirect)",
    400: "Bad Request",
    403: "Forbidden",
    404: "Not Found",
    500: "Internal Server Error",
    503: "Service Unavailable"
}

def parse_response_code(response_code: int):
    """
    Parse the response code of the Wayback Machine and return a human-readable message.
    """
    if response_code in RESPONSE_CODE_DICT:
        return RESPONSE_CODE_DICT[response_code]
    return "Unknown response code"

def csv_open(csv_path: str, url: str) -> object:
    """
    Open the CSV file with for writing snapshots and return the file object.
    """
    disallowed = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
    for char in disallowed:
        url = url.replace(char, '.')
    os.makedirs(os.path.abspath(csv_path), exist_ok=True)
    file = open(os.path.join(csv_path, f"waybackup_{url}.csv"), mode='w')
    return file

def csv_header(file: object):
    """
    Write the header of the CSV file.
    """
    import csv
    row = csv.DictWriter(file, sc.SNAPSHOT_COLLECTION[0].keys())
    row.writeheader()

def csv_write(file: object, snapshot: dict):
    """
    Write a snapshot to the CSV file.
    """
    import csv
    row = csv.DictWriter(file, snapshot.keys())
    row.writerow(snapshot)

def csv_close(file: object):
    """
    Close a CSV file and sort the entries by timestamp.
    """
    file.close()
    with open(file.name, 'r') as f:
        data = f.readlines()
    data[1:] = sorted(data[1:], key=lambda x: int(x.split(',')[0]))
    with open(file.name, 'w') as f:
        f.writelines(data)

Basiclly added some timeout settings and a 302 inf loop detection

Ghost-chu commented 3 months ago

I'm putting my point of suspicion on thread contention, I'll try setting the worker to 1 and see if that fixes it.

bitdruid commented 3 months ago

i will have a look at this later with your provided url. is the cpu at 100% right at the beginning?

Ghost-chu commented 3 months ago

i will have a look at this later with your provided url. is the cpu at 100% right at the beginning?

Initially the program is running normally, and at about 50 minutes, the program will suddenly get stuck and show abnormal CPU usage.

Ghost-chu commented 3 months ago

Here is command that what I used for:

~/.local/bin/waybackup -u http://mcbbs.net -c --csv --verbosity progress --retry 2 --workers 12 --end 20240117
Ghost-chu commented 3 months ago

Re-run with single worker, still get stucked:

Traceback for thread 2315809 (waybackup) [Has the GIL] (most recent call last):
    (Python) File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
        self._bootstrap_inner()
      Arguments:
        self: <Thread at 0x7f7c10e4b250>
    (Python) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
        self.run()
      Arguments:
        self: <Thread at 0x7f7c10e4b250>
    (Python) File "/usr/lib/python3.8/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      Arguments:
        self: <Thread at 0x7f7c10e4b250>
    (Python) File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 178, in download_loop

Command line: ~/.local/bin/waybackup -u http://mcbbs.net -f --csv --verbosity progress --retry 2 --workers 1 --end 20240117

Output when press Ctrl+C:

ghostchu@Home:~/all-time-dump$ ~/.local/bin/waybackup -u http://mcbbs.net -f --csv --verbosity progress --retry 2 --workers 1 --end 20240117

Downloading:   0%|░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 1298/1229604 [08:44<122:38:13,  2.78 snapshot/s]^CTraceback (most recent call last):
  File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
    sys.exit(main())
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main
    archive.download_list(args.output, args.retry, args.no_redirect, args.workers, file)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 155, in download_list
    thread.join()
  File "/usr/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
Ghost-chu commented 3 months ago

image

Ghost-chu commented 3 months ago

I think this part seems a bit strange ...... I don't use python much, but I think it's missing some indentation in front of it.

    if failed_urls:
        if not attempt > max_attempt: 
            v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
            time.sleep(15)
-        download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
+           download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)

Missing indentation causes failed_urls to be retried over and over again.
Now I will implement the fix and check the results.

bitdruid commented 3 months ago

I think this part seems a bit strange ...... I don't use python much, but I think it's missing some indentation in front of it.

    if failed_urls:
        if not attempt > max_attempt: 
            v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
            time.sleep(15)
-        download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
+           download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)

Missing indentation causes failed_urls to be retried over and over again.
Now I will implement the fix and check the results.

the retry-mechanic is already a part im overthinking. especially if it is really necessary. if i implement a temporary logfile to keep track of successed downloads this could also be used to retry missing files and so make the loop a bit easier to understand

Ghost-chu commented 3 months ago

I think this part seems a bit strange ...... I don't use python much, but I think it's missing some indentation in front of it.

    if failed_urls:
        if not attempt > max_attempt: 
            v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
            time.sleep(15)
-        download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
+           download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)

Missing indentation causes failed_urls to be retried over and over again. Now I will implement the fix and check the results.

the retry-mechanic is already a part im overthinking. especially if it is really necessary. if i implement a temporary logfile to keep track of successed downloads this could also be used to retry missing files and so make the loop a bit easier to understand

i remember requests already have built-in retry mechanic, we can use that.

also i noticed the requests dont have a timeout setting, so if it stuck when reading something, it will hang forever.

bitdruid commented 3 months ago

requests does - but you can't use requests lib here because you can't carry over the "bare" connection, and this results in a new connection being created each time requests is called, which then results in a ConnectionRefusedError.

Ghost-chu commented 3 months ago

It has been 3 hours after patch applied, seems everything is fine and working properly, waybackup has been fetch over 90,000 snapshots from wayback machine without any issues. I'm keeping watching it.

bitdruid commented 3 months ago

It has been 3 hours after patch applied, seems everything is fine and working properly, waybackup has been fetch over 90,000 snapshots from wayback machine without any issues. I'm keeping watching it.

thank you for your testing :) meanwhile i have implemented your feature suggestion with a file-skipping for existing downloads. i may replace --retry functionality with a forced creation of this download-log.

feel free to pr your chances into dev so i can review them later

Ghost-chu commented 3 months ago

It has been 3 hours after patch applied, seems everything is fine and working properly, waybackup has been fetch over 90,000 snapshots from wayback machine without any issues. I'm keeping watching it.

thank you for your testing :) meanwhile i have implemented your feature suggestion with a file-skipping for existing downloads. i may replace --retry functionality with a forced creation of this download-log.

feel free to pr your chances into dev so i can review them later

I need to keep testing, I'm finding that 100% CPU usage still exists, but at least for now the app doesn't completely lose response and still continues to download files.

Ghost-chu commented 3 months ago

It is very confusing that the high CPU situation recovers on its own after a period of time.

bitdruid commented 3 months ago

It is very confusing that the high CPU situation recovers on its own after a period of time.

i will check that out as soon as fileskipping is working as intended. and i will add a json output of example.com for testing... cdx server does not like that much requests

Ghost-chu commented 3 months ago

Thank you for your hard work, I will test it when available. My patch didn't seem to work and it stopped responding again. It's like all the workers are crashing but without any error messages. The speed gradually drops until it stops completely. (100% CPU still)

Ghost-chu commented 3 months ago

py-spy when freezed

image

For some reason, GIL tooked up to 90%+

Ghost-chu commented 3 months ago

IsADirectoryError come back again!

ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --csv --verbosity progress --retry 2 --skip --workers 8 --end 20240117

Downloading:   3%|?????????????????????????????????????????????????????????????????????| 19817/640281 [29:39<16:48:03, 10.26 snapshot/s]Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
    download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 243, in download
    with open(output_file, 'wb') as file:
IsADirectoryError: [Errno 21] Is a directory: '/home/ghostchu/dump/waybackup_snapshots/attachment.mcbbs.net/uc_server/data/avatar/001/74/40/69_avatar_big.jpg'

I'm on https://github.com/bitdruid/python-wayback-machine-downloader/commit/334cb3f5f459d1c1c43c81a825797b2a2cdb736a

Ghost-chu commented 3 months ago

Exception when press Ctrl+C to exit waybackup:

^CClosing files
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
    download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 261, in download
    csv_write(csvfile, snapshot_entry) if csvfile else None
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 335, in csv_write
    row.writerow(snapshot)
  File "/usr/lib/python3.8/csv.py", line 154, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
ValueError: I/O operation on closed file.
Traceback (most recent call last):
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 36, in main
    archive.download_list(args.output, args.retry, args.no_redirect, args.workers, csvfile, skipset)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 154, in download_list
    thread.join()
  File "/usr/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
    sys.exit(main())
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 40, in main
    archive.csv_close(csvfile) if csvfile else None
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 343, in csv_close
    data = f.readlines()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 2203: invalid start byte
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
    download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 261, in download
    csv_write(csvfile, snapshot_entry) if csvfile else None
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 335, in csv_write
    row.writerow(snapshot)
  File "/usr/lib/python3.8/csv.py", line 154, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
ValueError: I/O operation on closed file.

I'm on https://github.com/bitdruid/python-wayback-machine-downloader/commit/334cb3f5f459d1c1c43c81a825797b2a2cdb736a

Ghost-chu commented 3 months ago

FileNameTooLong Exception:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
    download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 243, in download
    with open(output_file, 'wb') as file:
OSError: [Errno 36] File name too long: '/home/ghostchu/dump/waybackup_snapshots/attachment.mcbbs.net/data/myattachment/forum/202301/15/185650rku6v589vuvfuovz.png?sign=q-sign-algorithm%3Dsha1%26q-ak%3DAKIDhlX3jQnP3QXFlkrkdagVJbyAEYdqrakl%26q-sign-time%3D1687544353%3B1687546213%26q-key-time%3D1687544353%3B1687546213%26q-header-list%3Dhost%26q-url-param-list%3Dresponse-content-disposition%26q-signature%3D6af65274d7bc8e9235867c3f0f42c218d5c667da&response-content-disposition=attachment%3B filename%3D%22YtXYvV.png%22'

I'm on https://github.com/bitdruid/python-wayback-machine-downloader/commit/334cb3f5f459d1c1c43c81a825797b2a2cdb736a

bitdruid commented 3 months ago

nearly done. im lastly trying to reproduce the encoding-exception

Ghost-chu commented 3 months ago

new dev doesn't work at all, both with or without snapshots directory:

ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117       |
Traceback (most recent call last):                                                                                                      |
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main                                        |
    skipfile, skipset = archive.skip_open(args.skip, args.url) if args.skip else (None, None)                                           
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 380, in skip_open                               
    skipfile = open(skipset_path, mode='r+')                                                                                            
IsADirectoryError: [Errno 21] Is a directory: '/home/ghostchu/dump/waybackup_snapshots'                                                 

Traceback (most recent call last):                                                                                                      
  File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>                                                                       
    sys.exit(main())                                                                                                                    
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 39, in main                                        
    archive.skip_close(skipfile, skipset) if args.skip else None                                                                        
UnboundLocalError: local variable 'skipfile' referenced before assignment                                                               
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 --cdxb
ackup --cdxinject                                                                                                                       
usage: waybackup [-h] [-a] [-u] (-c | -f | -s) [-l] [-e] [-o] [-r] [--start] [--end] [--skip ] [--csv ] [--cdx ] [--no-redirect]        
                 [--verbosity] [--retry] [--workers] [--cdxbackup [path] | --cdxinject [path]]                                          
waybackup: error: argument --cdxinject: not allowed with argument --cdxbackup                                                           
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 --cdxb
ackup                                                                                                                                   
Traceback (most recent call last):                                                                                                      |
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main                                        
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 380, in skip_open                               
    skipfile = open(skipset_path, mode='r+')                                                                                            
IsADirectoryError: [Errno 21] Is a directory: '/home/ghostchu/dump/waybackup_snapshots'                                                 

During handling of the above exception, another exception occurred:                                                                     

Traceback (most recent call last):                                                                                                      
  File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>                                                                       
    sys.exit(main())                                                                                                                    
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 39, in main                                        
    archive.skip_close(skipfile, skipset) if args.skip else None                                                                        
UnboundLocalError: local variable 'skipfile' referenced before assignment                                                               
ghostchu@Home:~/dump$ mv waybackup_snapshots waybackup_snapshotsaaa                                                                     
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 --cdxb
ackup                                                                                                                                   |
Traceback (most recent call last):                                                                                                      
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main                                        
    skipfile, skipset = archive.skip_open(args.skip, args.url) if args.skip else (None, None)                                           
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 383, in skip_open                               
    skipfile = open(default_path, mode='w')                                                                                             
FileNotFoundError: [Errno 2] No such file or directory: '/home/ghostchu/dump/waybackup_snapshots/waybackup_mcbbs.net.skip'              

During handling of the above exception, another exception occurred:                                                                     

Traceback (most recent call last):                                                                                                      
  File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>                                                                       
    sys.exit(main())                                                                                                                    
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 39, in main                                        
    archive.skip_close(skipfile, skipset) if args.skip else None                                                                        
UnboundLocalError: local variable 'skipfile' referenced before assignment                                            
bitdruid commented 3 months ago

i know, dev is unstable and not done

Ghost-chu commented 3 months ago

nearly done. im lastly trying to reproduce the encoding-exception

I don't know if this can help, but when I'm on bug-cleanup, i encountered same error but ASCII:

Downloading:   0%|░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 2532/640281 [02:01<2:45:17, 64.30 snapshot/s]|
Exception in thread Thread-4:                                                                                                           
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 189, in download_loop
    download_status = download(output, snapshot, connection, status, no_redirect, skipset)
  File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 221, in download
    connection.request("GET", download_url, headers=headers)
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1267, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.8/http/client.py", line 1105, in putrequest
    self._output(self._encode_request(request))
  File "/usr/lib/python3.8/http/client.py", line 1185, in _encode_request
    return request.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 100-108: ordinal not in range(128)
bitdruid commented 3 months ago

which snapshot or domain was requested?

Ghost-chu commented 3 months ago

which snapshot or domain was requested?

mcbbs.net
Ghost-chu commented 3 months ago

I've noticed a design mistake in the skip feature - namely that skipset doesn't save to a file unless the program exits normally, and loses the skip's data when the program crashes (which meaningless for me, since I always have to Ctrl+C to exit when the program loses its response)
I think the skipset should be saved to a file after every N URLs to avoid crashing and losing progress.

bitdruid commented 3 months ago

I've noticed a design mistake in the skip feature - namely that skipset doesn't save to a file unless the program exits normally, and loses the skip's data when the program crashes (which meaningless for me, since I always have to Ctrl+C to exit when the program loses its response) I think the skipset should be saved to a file after every N URLs to avoid crashing and losing progress.

im currently trying around with custom exception handlings because the overwhelming tracebacks driving me nuts. however i'd like to keep the finally-writing of the logfiles because this reduces I/O to an absolute minimum

bitdruid commented 3 months ago

which snapshot or domain was requested?

mcbbs.net

i encoded the url for downloading - maybe that did the trick.

tested now with a 175.000 k cdx query - downloaded 36 k snapshots without an error. all issues here should be resolved. if you find another one let me know.