Some of the improvements I've made to this for own use but I like to share it

Ghost-chu commented 5 months ago

This PR should definitely not be merged straight away (missing tests, coding style, and some random debugging code), but I hope this PR will help improve this

The main improvements are as follows:

When a worker crashes, the batch snapshots it was responsible for are lost and cannot be taken over by other workers. Solved by refactoring to a FIFO Queue
When the program gets stuck, an incorrect exit can cause skipfile not to be saved. Solved by periodic saving.
Checking skipset in the download method was very slow (due to things like Python's GIL, progress bar update stuff), and now skipset checking will be done before starting workers. (Handle 129232 skips in 27mins -> Handle 475784 skips in <30secs)
Now while checking the skipset, the local downloaded files are also checked in the file system and added to the skipset. In addition to improving performance, this also allows you to quickly recover skips in case of skipfile corruption.
Improved the progress bar related code to allow fast forwarding multiple steps.

bitdruid commented 5 months ago

please review my implementations as i wanted to do less deep changes in the code-logic

Ghost-chu commented 5 months ago

please review my implementations as i wanted to do less deep changes in the code-logic

Why not process the skipset before starting the workers to start processing snapshots? Processing skipset in download will generate a lot of method calls and a lot of v.writes. Python's text output performance is not good enough.

Considering that in my patch, processing it earlier resulted in a time savings of over 30 minutes, it seems well worth it.

bitdruid commented 5 months ago

please review my implementations as i wanted to do less deep changes in the code-logic

Why not process the skipset before starting the workers to start processing snapshots? Processing skipset in download will generate a lot of method calls and a lot of v.writes. Python's text output performance is not good enough.

Considering that in my patch, processing it earlier resulted in a time savings of over 30 minutes, it seems well worth it.

forgot about that one sorry. you are right

Ghost-chu commented 5 months ago

please review my implementations as i wanted to do less deep changes in the code-logic

Why not process the skipset before starting the workers to start processing snapshots? Processing skipset in download will generate a lot of method calls and a lot of v.writes. Python's text output performance is not good enough. Considering that in my patch, processing it earlier resulted in a time savings of over 30 minutes, it seems well worth it.

forgot about that one sorry. you are right

please review my implementations as i wanted to do less deep changes in the code-logic

Why not process the skipset before starting the workers to start processing snapshots? Processing skipset in download will generate a lot of method calls and a lot of v.writes. Python's text output performance is not good enough. Considering that in my patch, processing it earlier resulted in a time savings of over 30 minutes, it seems well worth it.

forgot about that one sorry. you are right

hi A friendly reminder - don't forget to fast forward the progress bar! It will look weird. Verbosity's wrtie function only allowed me to increase by 1, so I changed it to save a lot of function calls (by fast-forwarding all at once).

Based on my testing, it performs terribly if the progress bar is updated in a loop.

bitdruid / python-wayback-machine-downloader

Some of the improvements I've made to this for own use but I like to share it #10