Closed lostmagicblue closed 3 months ago
I can also reproduce it with same error, the workers crashed one by one and pages/s seems gets down.
one of the problems seems still the url-parsing.
this snapshot does except: http://web.archive.org/web/20210318001217id_/http://bbs.eastsea.com.cn/static/image/common
a blank page
because the cdx returns ...."/common" the parsing does "common" handle as a file but it was added as a folder before (contains images). i think this problem does only arise on the --current download as it merges multiple versions of the folders.
one of the problems seems still the url-parsing.
this snapshot does except: http://web.archive.org/web/20210318001217id_/http://bbs.eastsea.com.cn/static/image/common
a blank page
because the cdx returns ...."/common" the parsing does "common" handle as a file but it was added as a folder before (contains images). i think this problem does only arise on the --current download as it merges multiple versions of the folders.
we could save it to a special file like _____ROOT_____
jup, my first thought was just add an index.html if the folder itself already exists. 🤔
However, after some inspection, the above snapshot contains a redirect to ".../common/" and the parsing would already handle that case and create an index.html. so the problem here is that the current logic uses the snapshot served by the cdx-server to generate the output instead of the redirected url.
I'm going to change the output generation to fix this, but the snapshots will still be stored in the timestamp originally sent by the CDX server, as I think this would be the best representation of the result served by the CDX server for the user's query.
Thank you very much. Looking forward to your good news
@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far
ok了,thank you 牛!!!,厉害!!!
@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far
Thanks, I'm trying dev branching. As it stands, the URL problem has been fixed.
However I'm troubleshooting and while pulling oversized sites, the program is having issues stopping responding and going into a dead loop (CPU: 100%)
@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far
Thanks, I'm trying dev branching. As it stands, the URL problem has been fixed.
However I'm troubleshooting and while pulling oversized sites, the program is having issues stopping responding and going into a dead loop (CPU: 100%)
could you provide a snapshot which causes this issue?
@Ghost-chu @lostmagicblue please try the new version on pip - does it now work for you as expected? tried your provided urls and was able to download everything so far
Thanks, I'm trying dev branching. As it stands, the URL problem has been fixed. However I'm troubleshooting and while pulling oversized sites, the program is having issues stopping responding and going into a dead loop (CPU: 100%)
could you provide a snapshot which causes this issue?
Unfortunately, the console does not output any error messages. It just progress bar is no longer updated. top
command shows 100% CPU usage (single core fully occupied)
Now I've made some changes to the code to add some checks at locations where loops may be generated and hopefully it will indicate where the problem is occurring. I am in the process of re-pulling the data on the wayback machine and trying to reproduce the issue.
Another thing that I would very much like is for the program to be able to remember files that have been downloaded. The site I am working on has 640281 (current version only) files, so every time the program crashes, everything gets reset.
This obviously wastes time and network traffic, and puts more load on archive.org's servers.
good feature. we could use the existing csv or just put every successed DL url into a temp txt
After running for 50mins, it get stuck again, here is screenshot.
sudo -E ~/.local/bin/pystack remote 2315089 --locals --no-block
Traceback for thread 2315103 (waybackup) [] (most recent call last):
(Python) File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
Arguments:
self: <Thread at 0x7fcfddd12ca0>
(Python) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
Arguments:
self: <Thread at 0x7fcfddd12ca0>
(Python) File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
Arguments:
self: <Thread at 0x7fcfddd12ca0>
(Python) File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 178, in download_loop
status = f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{snapshot_batch.index(snapshot)+1}/{len(snapshot_batch)}] - Worker: {worker}"
I hope this can be helpful.
And here is modified archive.py
, i just added some checks and didn't change the code logic.
import requests
import os
import gzip
import threading
import time
import http.client
from urllib.parse import urljoin
from datetime import datetime, timezone
from pywaybackup.helper import url_get_timestamp, url_split, file_move_index
from pywaybackup.SnapshotCollection import SnapshotCollection as sc
from pywaybackup.Verbosity import Verbosity as v
# GET: store page to wayback machine and response with redirect to snapshot
# POST: store page to wayback machine and response with wayback machine status-page
# tag_jobid = '<script>spn.watchJob("spn2-%s", "/_static/",6000);</script>'
# tag_result_timeout = '<p>The same snapshot had been made %s minutes ago. You can make new capture of this URL after 1 hour.</p>'
# tag_result_success = ' A snapshot was captured. Visit page: <a href="%s">%s</a>'
def save_page(url: str):
"""
Saves a webpage to the Wayback Machine.
Args:
url (str): The URL of the webpage to be saved.
Returns:
None: The function does not return any value. It only prints messages to the console.
"""
v.write("\nSaving page to the Wayback Machine...")
connection = http.client.HTTPSConnection("web.archive.org")
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
connection.request("GET", f"https://web.archive.org/save/{url}", headers=headers)
v.write("\n-----> Request sent")
response = connection.getresponse()
response_status = response.status
if response_status == 302:
location = response.getheader("Location")
v.write("\n-----> Response: 302 (redirect to snapshot)")
snapshot_timestamp = datetime.strptime(url_get_timestamp(location), '%Y%m%d%H%M%S').strftime('%Y-%m-%d %H:%M:%S')
current_timestamp = datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')
timestamp_difference = (datetime.strptime(current_timestamp, '%Y-%m-%d %H:%M:%S') - datetime.strptime(snapshot_timestamp, '%Y-%m-%d %H:%M:%S')).seconds / 60
timestamp_difference = int(round(timestamp_difference, 0))
if timestamp_difference < 1:
v.write("\n-----> New snapshot created")
elif timestamp_difference > 1:
v.write(f"\n-----> Snapshot already exists. (1 hour limit) - wait for {60 - timestamp_difference} minutes")
v.write(f"TIMESTAMP SNAPSHOT: {snapshot_timestamp}")
v.write(f"TIMESTAMP REQUEST : {current_timestamp}")
v.write(f"\nLAST SNAPSHOT BACK: {timestamp_difference} minutes")
v.write(f"\nURL: {location}")
elif response_status == 404:
v.write("\n-----> Response: 404 (not found)")
v.write(f"\nFAILED -> URL: {url}")
else:
v.write("\n-----> Response: unexpected")
v.write(f"\nFAILED -> URL: {url}")
connection.close()
def print_list(csv: str = None):
v.write("")
count = sc.count_list()
if csv:
csv_header(csv)
for snapshot in sc.SNAPSHOT_COLLECTION:
csv_write(csv, snapshot)
if count == 0:
v.write("\nNo snapshots found")
else:
__import__('pprint').pprint(sc.SNAPSHOT_COLLECTION)
v.write(f"\n-----> {count} snapshots listed")
# create filelist
# timestamp format yyyyMMddhhmmss
def query_list(url: str, range: int, start: int, end: int, explicit: bool, mode: str):
try:
v.write("\nQuerying snapshots...")
query_range = ""
if not range:
if start: query_range = query_range + f"&from={start}"
if end: query_range = query_range + f"&to={end}"
else:
query_range = "&from=" + str(datetime.now().year - range)
# parse user input url and create according cdx url
domain, subdir, filename = url_split(url)
if domain and not subdir and not filename:
cdx_url = f"*.{domain}/*" if not explicit else f"{domain}"
if domain and subdir and not filename:
cdx_url = f"{domain}/{subdir}/*"
if domain and subdir and filename:
cdx_url = f"{domain}/{subdir}/{filename}/*"
if domain and not subdir and filename:
cdx_url = f"{domain}/{filename}/*"
v.write(f"---> {cdx_url}")
cdxQuery = f"https://web.archive.org/cdx/search/xd?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original&filter!=statuscode:200"
cdxResult = requests.get(cdxQuery, timeout=25)
sc.create_list(cdxResult, mode)
v.write(f"\n-----> {sc.count_list()} snapshots found")
except requests.exceptions.ConnectionError as e:
v.write(f"\n-----> ERROR: could not query snapshots:\n{e}"); exit()
# example download: http://web.archive.org/web/20190815104545id_/https://www.google.com/
def download_list(output, retry, no_redirect, workers, csv: str = None):
"""
Download a list of urls in format: [{"timestamp": "20190815104545", "url": "https://www.google.com/"}]
"""
if sc.count_list() == 0:
v.write("\nNothing to download");
return
v.write("\nDownloading snapshots...", progress=0)
if workers > 1:
v.write(f"\n-----> Simultaneous downloads: {workers}")
batch_size = sc.count_list() // workers + 1
else:
batch_size = sc.count_list()
sc.create_collection()
v.write("\n-----> Snapshots prepared")
if csv:
csv_header(csv)
batch_list = [sc.SNAPSHOT_COLLECTION[i:i + batch_size] for i in range(0, len(sc.SNAPSHOT_COLLECTION), batch_size)]
threads = []
worker = 0
for batch in batch_list:
worker += 1
thread = threading.Thread(target=download_loop, args=(batch, output, workers, retry, no_redirect, csv))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
def download_loop(snapshot_batch, output, worker, retry, no_redirect, csv=None, attempt=1, connection=None, depth=0):
"""
Download a list of URLs in a recursive loop. If a download fails, the function will retry the download.
The "snapshot_collection" dictionary will be updated with the download status and file information.
Information for each entry is written by "create_entry" and "snapshot_dict_append" functions.
"""
if depth >= 10:
return
max_attempt = retry if retry > 0 else retry + 1
failed_urls = []
if not connection:
connection = http.client.HTTPSConnection("web.archive.org", timeout=25)
if attempt > max_attempt:
connection.close()
v.write(f"\n-----> Worker: {worker} - Failed downloads: {len(snapshot_batch)}")
return
for snapshot in snapshot_batch:
status = f"\n-----> Attempt: [{attempt}/{max_attempt}] Snapshot [{snapshot_batch.index(snapshot)+1}/{len(snapshot_batch)}] - Worker: {worker}"
download_status = download(output, snapshot, connection, status, no_redirect, csv)
if not download_status:
failed_urls.append(snapshot)
if download_status:
v.write(progress=1)
attempt += 1
if failed_urls:
if not attempt > max_attempt:
v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
time.sleep(15)
download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection, depth+1)
def download(output, snapshot_entry, connection, status_message, no_redirect=False, csv=None):
"""
Download a single URL and save it to the specified filepath.
If there is a redirect, the function will follow the redirect and update the download URL.
gzip decompression is used if the response is encoded.
According to the response status, the function will write a status message to the console and append a failed URL.
"""
download_url = snapshot_entry["url_archive"]
max_retries = 2
sleep_time = 45
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}
for i in range(max_retries):
try:
connection.request("GET", download_url, headers=headers)
response = connection.getresponse()
response_data = response.read()
response_status = response.status
response_status_message = parse_response_code(response_status)
sc.snapshot_entry_modify(snapshot_entry, "response", response_status)
if not no_redirect:
if response_status == 302:
status_message = f"{status_message}\n" + \
f"REDIRECT -> HTTP: {response.status} - {response_status_message}\n" + \
f" -> FROM: {download_url}"
redirect_times = 0
while response_status == 302:
redirect_times = redirect_times + 1
if redirect_times > 10:
print("Stop! Too many redirects for url" + download_url+" skipping!")
break
connection.request("GET", download_url, headers=headers)
response = connection.getresponse()
response_data = response.read()
response_status = response.status
response_status_message = parse_response_code(response_status)
location = response.getheader("Location")
if location:
download_url = urljoin(download_url, location)
status_message = f"{status_message}\n" + \
f" -> TO: {download_url}"
sc.snapshot_entry_modify(snapshot_entry, "redirect_timestamp", url_get_timestamp(location))
sc.snapshot_entry_modify(snapshot_entry, "redirect_url", download_url)
else:
break
if response_status == 200:
output_file = sc.create_output(download_url, snapshot_entry["timestamp"], output)
output_path = os.path.dirname(output_file)
if os.path.isfile(output_path):
file_move_index(output_path)
else:
os.makedirs(output_path, exist_ok=True)
if not os.path.isfile(output_file):
with open(output_file, 'wb') as file:
if response.getheader('Content-Encoding') == 'gzip':
response_data = gzip.decompress(response_data)
file.write(response_data)
else:
file.write(response_data)
if os.path.isfile(output_file):
status_message = f"{status_message}\n" + \
f"SUCCESS -> HTTP: {response_status} - {response_status_message}"
sc.snapshot_entry_modify(snapshot_entry, "file", output_file)
csv_write(csv, snapshot_entry) if csv else None
else:
status_message = f"{status_message}\n" + \
f"EXISTING -> HTTP: {response_status} - {response_status_message}"
status_message = f"{status_message}\n" + \
f" -> URL: {download_url}\n" + \
f" -> FILE: {output_file}"
v.write(status_message)
return True
else:
status_message = f"{status_message}\n" + \
f"UNEXPECTED -> HTTP: {response_status} - {response_status_message}\n" + \
f" -> URL: {download_url}"
v.write(status_message)
return True
# exception returns false and appends the url to the failed list
except http.client.HTTPException as e:
status_message = f"{status_message}\n" + \
f"EXCEPTION -> ({i+1}/{max_retries}), append to failed_urls: {download_url}\n" + \
f" -> {e}"
v.write(status_message)
return False
# connection refused waits and retries
except ConnectionRefusedError as e:
status_message = f"{status_message}\n" + \
f"REFUSED -> ({i+1}/{max_retries}), reconnect in {sleep_time} seconds...\n" + \
f" -> {e}"
v.write(status_message)
time.sleep(sleep_time)
except BaseException as e:
v.write('Unable to finish request, error: '+repr(e))
time.sleep(sleep_time)
v.write(f"FAILED -> download, append to failed_urls: {download_url}")
return False
RESPONSE_CODE_DICT = {
200: "OK",
301: "Moved Permanently",
302: "Found (redirect)",
400: "Bad Request",
403: "Forbidden",
404: "Not Found",
500: "Internal Server Error",
503: "Service Unavailable"
}
def parse_response_code(response_code: int):
"""
Parse the response code of the Wayback Machine and return a human-readable message.
"""
if response_code in RESPONSE_CODE_DICT:
return RESPONSE_CODE_DICT[response_code]
return "Unknown response code"
def csv_open(csv_path: str, url: str) -> object:
"""
Open the CSV file with for writing snapshots and return the file object.
"""
disallowed = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
for char in disallowed:
url = url.replace(char, '.')
os.makedirs(os.path.abspath(csv_path), exist_ok=True)
file = open(os.path.join(csv_path, f"waybackup_{url}.csv"), mode='w')
return file
def csv_header(file: object):
"""
Write the header of the CSV file.
"""
import csv
row = csv.DictWriter(file, sc.SNAPSHOT_COLLECTION[0].keys())
row.writeheader()
def csv_write(file: object, snapshot: dict):
"""
Write a snapshot to the CSV file.
"""
import csv
row = csv.DictWriter(file, snapshot.keys())
row.writerow(snapshot)
def csv_close(file: object):
"""
Close a CSV file and sort the entries by timestamp.
"""
file.close()
with open(file.name, 'r') as f:
data = f.readlines()
data[1:] = sorted(data[1:], key=lambda x: int(x.split(',')[0]))
with open(file.name, 'w') as f:
f.writelines(data)
Basiclly added some timeout settings and a 302 inf loop detection
I'm putting my point of suspicion on thread contention, I'll try setting the worker to 1 and see if that fixes it.
i will have a look at this later with your provided url. is the cpu at 100% right at the beginning?
i will have a look at this later with your provided url. is the cpu at 100% right at the beginning?
Initially the program is running normally, and at about 50 minutes, the program will suddenly get stuck and show abnormal CPU usage.
Here is command that what I used for:
~/.local/bin/waybackup -u http://mcbbs.net -c --csv --verbosity progress --retry 2 --workers 12 --end 20240117
Re-run with single worker, still get stucked:
Traceback for thread 2315809 (waybackup) [Has the GIL] (most recent call last):
(Python) File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
Arguments:
self: <Thread at 0x7f7c10e4b250>
(Python) File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
Arguments:
self: <Thread at 0x7f7c10e4b250>
(Python) File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
Arguments:
self: <Thread at 0x7f7c10e4b250>
(Python) File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 178, in download_loop
Command line: ~/.local/bin/waybackup -u http://mcbbs.net -f --csv --verbosity progress --retry 2 --workers 1 --end 20240117
Output when press Ctrl+C:
ghostchu@Home:~/all-time-dump$ ~/.local/bin/waybackup -u http://mcbbs.net -f --csv --verbosity progress --retry 2 --workers 1 --end 20240117
Downloading: 0%|░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 1298/1229604 [08:44<122:38:13, 2.78 snapshot/s]^CTraceback (most recent call last):
File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
sys.exit(main())
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main
archive.download_list(args.output, args.retry, args.no_redirect, args.workers, file)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 155, in download_list
thread.join()
File "/usr/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
I think this part seems a bit strange ...... I don't use python much, but I think it's missing some indentation in front of it.
if failed_urls:
if not attempt > max_attempt:
v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds")
time.sleep(15)
- download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
+ download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
Missing indentation causes failed_urls
to be retried over and over again.
Now I will implement the fix and check the results.
I think this part seems a bit strange ...... I don't use python much, but I think it's missing some indentation in front of it.
if failed_urls: if not attempt > max_attempt: v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds") time.sleep(15) - download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection) + download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
Missing indentation causes
failed_urls
to be retried over and over again.
Now I will implement the fix and check the results.
the retry-mechanic is already a part im overthinking. especially if it is really necessary. if i implement a temporary logfile to keep track of successed downloads this could also be used to retry missing files and so make the loop a bit easier to understand
I think this part seems a bit strange ...... I don't use python much, but I think it's missing some indentation in front of it.
if failed_urls: if not attempt > max_attempt: v.write(f"\n-----> Worker: {worker} - Retry Timeout: 15 seconds") time.sleep(15) - download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection) + download_loop(failed_urls, output, worker, retry, no_redirect, csv, attempt, connection)
Missing indentation causes
failed_urls
to be retried over and over again. Now I will implement the fix and check the results.the retry-mechanic is already a part im overthinking. especially if it is really necessary. if i implement a temporary logfile to keep track of successed downloads this could also be used to retry missing files and so make the loop a bit easier to understand
i remember requests already have built-in retry mechanic, we can use that.
also i noticed the requests dont have a timeout setting, so if it stuck when reading something, it will hang forever.
requests does - but you can't use requests lib here because you can't carry over the "bare" connection, and this results in a new connection being created each time requests is called, which then results in a ConnectionRefusedError.
It has been 3 hours after patch applied, seems everything is fine and working properly, waybackup has been fetch over 90,000 snapshots from wayback machine without any issues. I'm keeping watching it.
It has been 3 hours after patch applied, seems everything is fine and working properly, waybackup has been fetch over 90,000 snapshots from wayback machine without any issues. I'm keeping watching it.
thank you for your testing :) meanwhile i have implemented your feature suggestion with a file-skipping for existing downloads. i may replace --retry functionality with a forced creation of this download-log.
feel free to pr your chances into dev so i can review them later
It has been 3 hours after patch applied, seems everything is fine and working properly, waybackup has been fetch over 90,000 snapshots from wayback machine without any issues. I'm keeping watching it.
thank you for your testing :) meanwhile i have implemented your feature suggestion with a file-skipping for existing downloads. i may replace --retry functionality with a forced creation of this download-log.
feel free to pr your chances into dev so i can review them later
I need to keep testing, I'm finding that 100% CPU usage still exists, but at least for now the app doesn't completely lose response and still continues to download files.
It is very confusing that the high CPU situation recovers on its own after a period of time.
It is very confusing that the high CPU situation recovers on its own after a period of time.
i will check that out as soon as fileskipping is working as intended. and i will add a json output of example.com for testing... cdx server does not like that much requests
Thank you for your hard work, I will test it when available. My patch didn't seem to work and it stopped responding again. It's like all the workers are crashing but without any error messages. The speed gradually drops until it stops completely. (100% CPU still)
py-spy when freezed
For some reason, GIL tooked up to 90%+
IsADirectoryError come back again!
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --csv --verbosity progress --retry 2 --skip --workers 8 --end 20240117
Downloading: 3%|?????????????????????????????????????????????????????????????????????| 19817/640281 [29:39<16:48:03, 10.26 snapshot/s]Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 243, in download
with open(output_file, 'wb') as file:
IsADirectoryError: [Errno 21] Is a directory: '/home/ghostchu/dump/waybackup_snapshots/attachment.mcbbs.net/uc_server/data/avatar/001/74/40/69_avatar_big.jpg'
Exception when press Ctrl+C to exit waybackup:
^CClosing files
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 261, in download
csv_write(csvfile, snapshot_entry) if csvfile else None
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 335, in csv_write
row.writerow(snapshot)
File "/usr/lib/python3.8/csv.py", line 154, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
ValueError: I/O operation on closed file.
Traceback (most recent call last):
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 36, in main
archive.download_list(args.output, args.retry, args.no_redirect, args.workers, csvfile, skipset)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 154, in download_list
thread.join()
File "/usr/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
sys.exit(main())
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 40, in main
archive.csv_close(csvfile) if csvfile else None
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 343, in csv_close
data = f.readlines()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 2203: invalid start byte
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 261, in download
csv_write(csvfile, snapshot_entry) if csvfile else None
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 335, in csv_write
row.writerow(snapshot)
File "/usr/lib/python3.8/csv.py", line 154, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
ValueError: I/O operation on closed file.
FileNameTooLong Exception:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 176, in download_loop
download_status = download(output, snapshot, connection, status, no_redirect, csvfile, skipset)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 243, in download
with open(output_file, 'wb') as file:
OSError: [Errno 36] File name too long: '/home/ghostchu/dump/waybackup_snapshots/attachment.mcbbs.net/data/myattachment/forum/202301/15/185650rku6v589vuvfuovz.png?sign=q-sign-algorithm%3Dsha1%26q-ak%3DAKIDhlX3jQnP3QXFlkrkdagVJbyAEYdqrakl%26q-sign-time%3D1687544353%3B1687546213%26q-key-time%3D1687544353%3B1687546213%26q-header-list%3Dhost%26q-url-param-list%3Dresponse-content-disposition%26q-signature%3D6af65274d7bc8e9235867c3f0f42c218d5c667da&response-content-disposition=attachment%3B filename%3D%22YtXYvV.png%22'
nearly done. im lastly trying to reproduce the encoding-exception
new dev doesn't work at all, both with or without snapshots directory:
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 |
Traceback (most recent call last): |
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main |
skipfile, skipset = archive.skip_open(args.skip, args.url) if args.skip else (None, None)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 380, in skip_open
skipfile = open(skipset_path, mode='r+')
IsADirectoryError: [Errno 21] Is a directory: '/home/ghostchu/dump/waybackup_snapshots'
Traceback (most recent call last):
File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
sys.exit(main())
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 39, in main
archive.skip_close(skipfile, skipset) if args.skip else None
UnboundLocalError: local variable 'skipfile' referenced before assignment
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 --cdxb
ackup --cdxinject
usage: waybackup [-h] [-a] [-u] (-c | -f | -s) [-l] [-e] [-o] [-r] [--start] [--end] [--skip ] [--csv ] [--cdx ] [--no-redirect]
[--verbosity] [--retry] [--workers] [--cdxbackup [path] | --cdxinject [path]]
waybackup: error: argument --cdxinject: not allowed with argument --cdxbackup
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 --cdxb
ackup
Traceback (most recent call last): |
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 380, in skip_open
skipfile = open(skipset_path, mode='r+')
IsADirectoryError: [Errno 21] Is a directory: '/home/ghostchu/dump/waybackup_snapshots'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
sys.exit(main())
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 39, in main
archive.skip_close(skipfile, skipset) if args.skip else None
UnboundLocalError: local variable 'skipfile' referenced before assignment
ghostchu@Home:~/dump$ mv waybackup_snapshots waybackup_snapshotsaaa
ghostchu@Home:~/dump$ ~/.local/bin/waybackup -u http://mcbbs.net -c --verbosity progress --workers 10 --csv --skip --end 20240117 --cdxb
ackup |
Traceback (most recent call last):
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 32, in main
skipfile, skipset = archive.skip_open(args.skip, args.url) if args.skip else (None, None)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 383, in skip_open
skipfile = open(default_path, mode='w')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ghostchu/dump/waybackup_snapshots/waybackup_mcbbs.net.skip'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ghostchu/.local/bin/waybackup", line 8, in <module>
sys.exit(main())
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/main.py", line 39, in main
archive.skip_close(skipfile, skipset) if args.skip else None
UnboundLocalError: local variable 'skipfile' referenced before assignment
i know, dev is unstable and not done
nearly done. im lastly trying to reproduce the encoding-exception
I don't know if this can help, but when I'm on bug-cleanup, i encountered same error but ASCII:
Downloading: 0%|░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 2532/640281 [02:01<2:45:17, 64.30 snapshot/s]|
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 189, in download_loop
download_status = download(output, snapshot, connection, status, no_redirect, skipset)
File "/home/ghostchu/.local/lib/python3.8/site-packages/pywaybackup/archive.py", line 221, in download
connection.request("GET", download_url, headers=headers)
File "/usr/lib/python3.8/http/client.py", line 1256, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1267, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.8/http/client.py", line 1105, in putrequest
self._output(self._encode_request(request))
File "/usr/lib/python3.8/http/client.py", line 1185, in _encode_request
return request.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 100-108: ordinal not in range(128)
which snapshot or domain was requested?
which snapshot or domain was requested?
mcbbs.net
I've noticed a design mistake in the skip feature - namely that skipset doesn't save to a file unless the program exits normally, and loses the skip's data when the program crashes (which meaningless for me, since I always have to Ctrl+C to exit when the program loses its response)
I think the skipset should be saved to a file after every N URLs to avoid crashing and losing progress.
I've noticed a design mistake in the skip feature - namely that skipset doesn't save to a file unless the program exits normally, and loses the skip's data when the program crashes (which meaningless for me, since I always have to Ctrl+C to exit when the program loses its response) I think the skipset should be saved to a file after every N URLs to avoid crashing and losing progress.
im currently trying around with custom exception handlings because the overwhelming tracebacks driving me nuts. however i'd like to keep the finally-writing of the logfiles because this reduces I/O to an absolute minimum
which snapshot or domain was requested?
mcbbs.net
i encoded the url for downloading - maybe that did the trick.
tested now with a 175.000 k cdx query - downloaded 36 k snapshots without an error. all issues here should be resolved. if you find another one let me know.
Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib64/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 169, in download_loop download_status = download(output, snapshot, connection, status, no_redirect) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 228, in download with open(download_file, 'wb') as file: IsADirectoryError: [Errno 21] Is a directory: '/root/downloader/waybackup_snapshots/bbs.eastsea.com.cn/static/image/common' 这个也是 Traceback (most recent call last): File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib64/python3.8/threading.py", line 870, in run self._target(self._args, **self._kwargs) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 169, in download_loop download_status = download(output, snapshot, connection, status, no_redirect) File "/usr/local/lib/python3.8/site-packages/pywaybackup/archive.py", line 228, in download with open(download_file, 'wb') as file: IsADirectoryError: [Errno 21] Is a directory: '/root/waybackup_snapshots/bbs.tianmu.com/simple'-bash: Exception: command not found
If you are free, please take a look, thank you