cebtenzzre / tumblr-utils

A fork of tumblr-utils with Python 3 support, bug fixes, and lots of features I found useful.
GNU General Public License v3.0
39 stars 7 forks source link

Program stalls/can't download entire blog #8

Open ddescent opened 1 year ago

ddescent commented 1 year ago

Downloaded 3 days ago, have been trying since then and don't know what I'm doing wrong. I know pretty much nothing about Python or any coding language so this is all pretty new to me.

I've tried the all of these variations of the command: tumblr_backup.py -i --save-video --save-audio --tag-index blog-name tumblr_backup.py --save-video --save-audio --tag-index blog-name tumblr_backup.py -i --save-video --save-audio --tag-index -p year blog-name tumblr_backup.py -i --save-video --save-audio --tag-index -p year-month blog-name tumblr_backup.py --save-video --save-audio --tag-index -p year-month blog-name

The first two would work at first, but eventually result in a stall, with a message like "downloading 7000 to 7050" that never moved again. I saw people saying this would be fixed with the -p command, so I tried that. It worked for most of my blog (2016 to 2020), but I got the same stall once I tried 2021. So then, I tried adding the month to the command. After some frustration with the program telling me "Stopping backup: Incremental backup complete, 0 posts backed up", I took out the -i command and that seemed to work. But now I am stuck again, this time on the message "Waiting for worker threads to finish." I don't know what's causing these stalls or how to fix them. I had seen some people saying it could be caused by the fancy/colored text offered in more recent Tumblr updates, but the post that seemed to stall one of my "year-month" attempts didn't have any of that, it was just an image.

cebtenzzre commented 1 year ago

If you can apply this patch, either by hand or with GNU patch (copy this to a text file, including the whitespace at the end, and run patch -Np1 -i /path/to/saved/patch in the same directory as tumblr_backup.py) it will tell me which threads are getting stuck and where instead of just stopping at "Waiting for worker threads to finish".

This assumes 10 seconds should be enough for everything to finish, but if you're more patient you could try changing the number on the timeout = time.time() + 10 line to maybe 20 or 30 for a more accurate result.

diff --git a/tumblr_backup.py b/tumblr_backup.py
index d9fb4ea..292fbc7 100755
--- a/tumblr_backup.py
+++ b/tumblr_backup.py
@@ -1520,7 +1520,7 @@ class ThreadPool:
         self.queue = LockedQueue(threading.RLock(), max_queue)
         self.quit = threading.Event()
         self.abort = threading.Event()
-        self.threads = [threading.Thread(target=self.handler) for _ in range(thread_count)]
+        self.threads = [threading.Thread(target=self.handler, daemon=True) for _ in range(thread_count)]
         for t in self.threads:
             t.start()

@@ -1540,9 +1540,16 @@ class ThreadPool:
     def cancel(self):
         self.abort.set()
         no_internet.destroy()
+
+        import traceback
+        timeout = time.time() + 10
         for i, t in enumerate(self.threads, start=1):
             logger.status('Stopping threads {}{}\r'.format(' ' * i, '.' * (len(self.threads) - i)))
-            t.join()
+            t.join(max(1, timeout - time.time()))
+        for t in self.threads:
+            if t.is_alive():
+                print(t, 'is stuck')
+                traceback.print_stack(sys._current_frames()[t.ident])

         logger.info('Backup canceled.\n')
ddescent commented 1 year ago

Thank you for your response! Unfortunately, I haven't been able to recreate the issue because of a new one arising. The program will now get stuck with the message "DNS probe finished: No internet. Waiting...o finish", which is confusing because it will say this despite my computer being connected to the internet and being able to load websites. (Sorry that it said I marked as completed, I am apparently bad with websites too and accidentally marked that haha)

cebtenzzre commented 1 year ago

Hm, that's weird. That would imply that your computer is somehow unable to reach Google DNS (8.8.8.8), which the script checks if a web request failed in case you don't have internet. Can you ping 8.8.8.8 ok? What about dig google.com @8.8.8.8 (Linux/macOS) or nslookup google.com 8.8.8.8 (Windows)?

ddescent commented 1 year ago

I didn't have any issues pinging/connecting to 8.8.8.8 with those commands

cebtenzzre commented 1 year ago

For now you can bypass the check by adding a line to is_dns_working in util.py, like this:

 util.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/util.py b/util.py
index 3bbd5c3..dfef1dc 100644
--- a/util.py
+++ b/util.py
@@ -97,6 +97,7 @@ DNS_QUERY = b'\xf1\xe1\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x06google\x03com\

 def is_dns_working(timeout=None):
+    return True
     try:
         with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as sock:
             if timeout is not None:

I haven't decided what to do about this yet. I suppose having a way to specify an alternate DNS server or disable the feature entirely might be useful if Google DNS isn't available. I can't think of any reason why dig or nslookup would succeed but the check in the script would fail, unless your internet connection is so slow that it takes more than 5 seconds to get a reply - maybe an option to change the timeout would help?

cebtenzzre commented 1 year ago

I just pushed 91d872a5b4e397206de30ea87caec4757c3374a0 which provides a --skip-dns-check option you can use to work around that issue. Let me know if you run into anything else.

Demirath commented 6 months ago

I might be able to add more context, as it seems to be specific posts which throw the DNS error for me. A specific .[post id].html.[string] file will refuse to download after the error is thrown, and when I wait for all the other queued files to finish (so I can tell which one it is), get the post id, delete my reblog from Tumblr and rerun, then it continues until it hits the next one. I'm unsure what the posts have in common, but this one threw the error twice, once in a 2022 reblog and once in a 2021 reblog: https://www.tumblr.com/bunjywunjy/669018562974957568/petermorwood-caitlynlynch-the1920sinpictures

cebtenzzre commented 6 months ago

I might be able to add more context, as it seems to be specific posts which throw the DNS error for me.

This is known - the script only attempts to check for a working internet connection when some network request fails. I had assumed that basically everyone with a working internet connection should be able to send a DNS query to Google, but apparently this is not true - some people are simply unable to e.g. dig google.com @8.8.8.8 (Linux/Mac) or nslookup google.com 8.8.8.8 (Windows), despite otherwise having functioning internet access.

I think the only reason this DNS request would (falsely) fail would be if your internet connection is aggressively firewalled, e.g. becuase you are using a VPN client that tries to prevent leaks of DNS traffic onto the public internet. Does that apply to you?

I suppose this should be changed to a simple HTTP request - perhaps a HEAD request to Tumblr's homepage.

Demirath commented 6 months ago

No, as far as I know my internet connection is completely VPN-free.

crispin-cas9 commented 6 months ago

I'm currently having the same problem as OP originally had when I try to backup my blog - it stalls at around 7700/51000. No DNS error messages on my end though. I assume it must be getting stuck on a particular post. Any thoughts on how I could try to bypass it? Would the same fixes suggested earlier in the thread be worth trying?

hibiscera commented 6 months ago

Also seconding having the same problem as OP, my backup is getting consistently stuck at 25200/33725, all four times I've tried to backup the blog! I also tried by year and immediately get the stall once I try 2012.

aureliawisenri commented 6 months ago

also having the same problem - on two of my sub-1k post sideblogs, everything was fine, but when i moved to back up the first of one of my more moderately-sized sideblogs, it started consistently stalling at 2250 to 2299 (of 4449 expected).