bitdruid / python-wayback-machine-downloader

Query and download archive.org as simple as possible.
MIT License
24 stars 1 forks source link

Bug/Crash #22

Open grigzy28 opened 2 weeks ago

grigzy28 commented 2 weeks ago

Windows 11 OS

Just tried this and received the following error. Empty output directory.


./waybackup -d --csv -u http://wuarchive.wustl.edu/pub/ -o .\test12 -f --workers 1 --skip --delay 1

No CSV-file or content found to load skipable URLs

Querying snapshots... ---> wuarchive.wustl.edu/pub/*

!-- Exception: UNCAUGHT EXCEPTION !-- File: ..............\Program Files\Python312\Lib\json\decoder.py !-- Function: raw_decode !-- Line: 355 !-- Segment: raise JSONDecodeError("Expecting value", s, err.value) from None !-- Description: Expecting value: line 1 column 1 (char 0)

Exception log: .\test12\waybackup_error.log

Full traceback: Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "waybackup.exe__main.py", line 7, in sys.exit(main()) ^^^^^^ File "..\site-packages\pywaybackup\main.py", line 22, in main archive.query_list(config.range, config.start, config.end, config.explicit, config.mode, config.cdxbackup, config.cdxinject) File "..\site-packages\pywaybackup\archive.py", line 158, in query_list cdxResult = json.loads(cdxResult) ^^^^^^^^^^^^^^^^^^^^^ File "..............\Program Files\Python312\Lib\json\init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "..............\Program Files\Python312\Lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "..............\Program Files\Python312\Lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)


waybackup_error.log

grigzy28 commented 2 weeks ago

Subsequent run(s) (that actually worked) got these following error...


command: ./waybackup -d --csv -u http://wuarchive.wustl.edu/pub/ -o .\test13 -f --workers 1 --skip --delay 1


-----> Worker: 1 - Delay: 1 seconds

-----> Attempt: [1/1] Snapshot [3880/670904] - Worker: 1 INCOMPLETEREAD -> (1/2): reconnect in 50 seconds...

!-- Exception: Worker: 1 - Exception !-- File: ..............\Program Files\Python312\Lib\ssl.py !-- Function: send !-- Line: 1180 !-- Segment: return self._sslobj.write(data) !-- Description: TLS/SSL connection has been closed (EOF) (_ssl.c:2406)

Exception log: .\test13\waybackup_error.log

Full traceback: Traceback (most recent call last): File "..\site-packages\pywaybackup\archive.py", line 231, in download_loop download_status = download(output, snapshot, connection, status_message, no_redirect) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "..\site-packages\pywaybackup\archive.py", line 271, in download response, response_data, response_status, response_status_message = download_response(connection, encoded_download_url, headers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "..\site-packages\pywaybackup\archive.py", line 343, in download_response connection.request("GET", encoded_download_url, headers=headers) File "..............\Program Files\Python312\Lib\http\client.py", line 1336, in request self._send_request(method, url, body, headers, encode_chunked) File "..............\Program Files\Python312\Lib\http\client.py", line 1382, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "..............\Program Files\Python312\Lib\http\client.py", line 1331, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "..............\Program Files\Python312\Lib\http\client.py", line 1091, in _send_output self.send(msg) File "..............\Program Files\Python312\Lib\http\client.py", line 1055, in send self.sock.sendall(data) File "..............\Program Files\Python312\Lib\ssl.py", line 1211, in sendall v = self.send(byte_view[count:]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "..............\Program Files\Python312\Lib\ssl.py", line 1180, in send return self._sslobj.write(data) ^^^^^^^^^^^^^^^^^^^^^^^^ ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) (_ssl.c:2406)

Files downloaded: 2655 Not downloaded: 668249

waybackup_error.log waybackup_http.wuarchive.wustl.edu.pub.csv

the following is from the test12 run that worked after the very first error in the first post:

waybackup_error.log waybackup_http.wuarchive.wustl.edu.pub.csv

grigzy28 commented 2 weeks ago

actually upon inspection, it appears that both runs ended at the same spot, the files are identical

bitdruid commented 2 weeks ago

hm ssl.SSLZeroReturnError seems not like a problem within the code... can you get the exact snapshot url which causes this error? so i could dive a bit into investigations.

and also maybe try pip update for 1.5.0

grigzy28 commented 2 weeks ago

Ok, looked to see if I can find the URL but it's not in the csv nor the window.

Also why does it download the files twice? In the CSV it shows that each file is downloaded twice with a status of 200 for OK....

And will try the newest version that you just put out.

grigzy28 commented 2 weeks ago

I also just noticed that the delay function isn't applying to the failed 404/301 urls but appears to only work with the 200 status ones.


Not Working Delay:

-----> Attempt: [1/1] Snapshot [801/670904] - Worker: 1 UNEXPECTED -> HTTP : 301 - Moved Permanently -> URL : https://web.archive.org/web/20101007093523id_/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme FAILED -> : append to failedurls: https://web.archive.org/web/20101007093523id/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme

-----> Attempt: [1/1] Snapshot [802/670904] - Worker: 1 UNEXPECTED -> HTTP : 301 - Moved Permanently -> URL : https://web.archive.org/web/20101007093523id_/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme FAILED -> : append to failedurls: https://web.archive.org/web/20101007093523id/http://wuarchive.wustl.edu:80/pub/aminet/pix/mwb/samwb6.2.readme


Working Delay:

-----> Attempt: [1/1] Snapshot [1137/670904] - Worker: 1 SUCCESS -> HTTP : 200 - OK -> URL : https://web.archive.org/web/20100214085623id_/http://wuarchive.wustl.edu:80/pub/fedora10/media.repo -> FILE : C:\users\shawn\appdata\roaming\python\Python312\Scripts\test14\wuarchive.wustl.edu\20100214085623\pub\fedora10\media.repo

-----> Worker: 1 - Delay: 1 seconds

-----> Attempt: [1/1] Snapshot [1138/670904] - Worker: 1 EXISTING -> HTTP : 200 - OK -> URL : https://web.archive.org/web/20100214085623id_/http://wuarchive.wustl.edu:80/pub/fedora10/media.repo -> FILE : C:\users\shawn\appdata\roaming\python\Python312\Scripts\test14\wuarchive.wustl.edu\20100214085623\pub\fedora10\media.repo

-----> Worker: 1 - Delay: 1 seconds


bitdruid commented 2 weeks ago

to the delay. currently the logic is, that there is a 15 seconds timeout anyway for a retry. thats why i left the delay only for successful downloads. you think it would be better to include it into any status?

for the dubplicate downloads:

check the cdx response manually:

https://web.archive.org/cdx/search/cdx?output=json&url=wuarchive.wustl.edu/pub/*&fl=timestamp,digest,mimetype,statuscode,original&limit=5&filter!=statuscode:200

so for timestamps 19980123002752 there are 2 digest (archive thinks both are not the same

for timestamp 19970101083806 however there are the same digest. so this seems to be a problem with the CDX response. funnily the param showDupeCount=true adviced by archive.org to remove duplicates from the result does not work...

https://web.archive.org/cdx/search/cdx?output=json&url=wuarchive.wustl.edu/pub/*&fl=timestamp,digest,mimetype,statuscode,original&limit=5&showDupeCount=true&filter!=statuscode:200

bitdruid commented 2 weeks ago

so i added a filter:

if a snapshot has same TIMESTAMP & URL, duplicates are removed.

however i dont know why the cdx server does respond with duplicates...

grigzy28 commented 2 weeks ago

to the delay. currently the logic is, that there is a 15 seconds timeout anyway for a retry. thats why i left the delay only for successful downloads. you think it would be better to include it into any status?

When I was watching it to get the URL that was failing for you earlier, it wasn't pausing the 15 seconds as there wasn't a timeout on those 301/404 codes, they were immediate responses and not timeouts. I didn't really mean to inform you about the duplicates, that just happened. Honestly didn't even know I had pasted that because I was showing the delay function. :)

Thanks, will try the latest commit now.

grigzy28 commented 2 weeks ago

Also, added --debug to the command but it still didn't give the URL of the original problem noted initially about the SSL/TLS error EOF... Going to see if the latest commit happens to have corrected it.

bitdruid commented 2 weeks ago

The --debug command was removed in 1.5.0, just so you know

grigzy28 commented 2 weeks ago

Thanks, I also just updated the install however it isn't putting the waybackup.exe in the scripts folder like it used to. Is that something in your installer or something else?

bitdruid commented 2 weeks ago

sorry im not on windows but when i was debugging on win i just created a virtual env and installed it inside that via pip

grigzy28 commented 2 weeks ago

that's what I did/do but for some reason it's not creating the waybackup.exe this time

oh... I just found out it moved it from the appdata folder to the program files scripts folders... strange

grigzy28 commented 2 weeks ago

Okay, running 1.5.1 get this now


PS C:\users\shawn\appdata\roaming\python\Python312\Scripts> ./waybackup.exe --csv -u http://wuarchive.wustl.edu/pub/ -o .\test15 -f --workers 1 --skip --delay 1

No CSV-file or content found to load skipable URLs

Querying snapshots... -----> wuarchive.wustl.edu/pub/* -----> Downloading CDX result: 12.6MB [03:00, 69.7kB/s]

!-- Exception: UNCAUGHT EXCEPTION !-- File: ..\site-packages\requests\models.py !-- Function: generate !-- Line: 818 !-- Segment: raise ChunkedEncodingError(e) !-- Description: ('Connection broken: IncompleteRead(7451 bytes read, 741 more expected)', IncompleteRead(7451 bytes read, 741 more expected))

Exception log: .\test15\waybackup_error.log

waybackup_error.log waybackup_http.wuarchive.wustl.edu.pub.cdx.txt

grigzy28 commented 2 weeks ago

Updated python from 3.12.4 to 3.12.5 and it started working correctly so far. I think my internet may have been going slow as well for that error above. Will keep you updated when this last test finishes.

grigzy28 commented 2 weeks ago

Okay, here's the results, same TLS/SSL issue but attached are the data files.


-----> Attempt: [1/1] Snapshot [1936/335537] - Worker: 1 SUCCESS -> HTTP : 200 - OK -> URL : https://web.archive.org/web/20081011175454id_/http://wuarchive.wustl.edu/pub/aminet/comm/xeno/frqsta11.readme -> FILE : C:\users\shawn\appdata\roaming\python\Python312\Scripts\test16\wuarchive.wustl.edu\20081011175454\pub\aminet\comm\xeno\frqsta11.readme

-----> Worker: 1 - Delay: 1 seconds

-----> Attempt: [1/1] Snapshot [1937/335537] - Worker: 1 INCOMPLETEREAD -> (1/2): reconnect in 50 seconds...

!-- Exception: Worker: 1 - Exception !-- File: ..............\Program Files\Python312\Lib\ssl.py !-- Function: send !-- Line: 1180 !-- Segment: return self._sslobj.write(data) !-- Description: TLS/SSL connection has been closed (EOF) (_ssl.c:2406)

Exception log: .\test16\waybackup_error.log

Files downloaded: 1312 Not downloaded: 334225


test16.zip

bitdruid commented 2 weeks ago

and the reconnect does not work? exception http.client.IncompleteRead should be a subclass of already catched http.client.HTTPException

i tried in a win vm and had no issues so far. downloading without any problems

grigzy28 commented 2 weeks ago

I just restarted it on the test16 folder with the csv and it started at and tried at file 1937 to download and it's doing the same thing, I disabled all AV just in case that was causing a connection issue. Not really sure what's going on.

bitdruid commented 2 weeks ago

strange. but okay give me some time. i decided to redesign the whole retry logic to include such strange exceptions and create a new connection when they occure. this MAY solve one or two issues...

however retry is not working as intended since i implemented the queue

bitdruid commented 2 weeks ago

i patched dev. you could build it from dev and have a try if your exception gets catched and retried properly. still BETA of course :)