bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
668 stars 124 forks source link

Tag list not being generated #183

Closed rominatrix closed 5 years ago

rominatrix commented 5 years ago

After almost 2 days, tumblr_backup.py finished generating the backup of my 87k posts blog (resulting in a 141GB directory). When I tried opening my "tag list index" the index.html file (that is created inside the "tags" directory) was not there. Also I've noticed that a remarkable small amount of tags (only 104) were saved in the "tags" directory.

Before making this huge backup, I've tested it first with the "--no-reblog" flag (resulting in a 16GB directory) and the tag list index was generated properly.

Now, I'm assuming the problem (maybe that's not the case) that it has to do with the fact that a lot of my tags have non ascii characters, eg the last tag dir that was saved is as such:

%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21

I didn't save the STDERR to a file but instead it ended up on my screen and the last error I have there is like this:

IOError: [Errno 2] No such file or directory: 'D:\\backup\\posts\\everything\\tags\\%21%21%21%21%21%21%21%21%21%21%21%21%21%21 %21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%2 1%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21%21\\archive\\2014-08-p1.html' (btw, those tags are correct, it's just a bunch of !!!!)

I'm not sure what could be the problem. I have no way of removing said tags because in 87k posts there are a lot of tags like that also I don't know what is causing this. I have saved all the json files if that helps. I'd be willing to run the script again only maybe to re-generate the tag list, if that's possible (maybe commenting the part of the code where it saves posts?)

Thanks in advance.

cebtenzzre commented 5 years ago

Perhaps create a fresh backup with the patch from this comment applied.

rominatrix commented 5 years ago

Perhaps create a fresh backup with the patch from this comment applied.

I've just checked and I had applied that patch already before running it. 2018-12-14_115414

cebtenzzre commented 5 years ago

Hm, then maybe the patch is at fault. @ecm-pushbx do you see anything obvious that could be causing this?

ecm-pushbx commented 5 years ago

@rominatrix Please create a small test blog with entries with tags that are causing your problems.

rominatrix commented 5 years ago

@ecm-pushbx I have probably thousands of different tags, mostly just keysmashing so I have no idea which one is causing this error. I can generate a small test blog and make different types of tags that are not-ascii but still could not replicate it. Not sure if I can debug it, maybe it could be easier if I could just skip the parts of the script that downloads images etc and leave only the tags part? But I would need some help with that.

ecm-pushbx commented 5 years ago

It might be due to tag length and the length of your path of the directory into which you're storing. There are some maximums there, around 255 bytes I think.

Not sure if I can debug it, maybe it could be easier if I could just skip the parts of the script that downloads images etc and leave only the tags part? But I would need some help with that.

You'll have to ask @bbolli or someone else more familiar with the operation of the script. I don't know either, I'm just hacking around some.

bbolli commented 5 years ago

You can always regenerate the complete indices by backing up just one post with -n1 --tag-index. To build the index (both of them), the posts are read from the disk and the tags parsed from the file contents. Otherwise, incremental backup wouldn't create the whole index.

rominatrix commented 5 years ago

@bbolli thank you, i will try that!

elendraug commented 5 years ago

@bbolli thank you, i will try that!

Did it work? I'm dealing with a similar problem and unfortunately not familiar enough to apply the patch on my own. I started running a backup overnight with a version of tumblr-utils I downloaded last week (a mistake on my part).

If I've already successfully downloaded the blog locally, would I need to download the newest version of the code and start it over entirely/download from tumblr fresh, or do I need to move to downloaded contents into the new version folder and then re-run it to regenerate the indices as described above? (Or maybe move the new code into the existing previous folder, instead?) I'm hoping I can solve this locally rather than re-downloading 50GB of content. I'm just hesitant to restart the process without feeling confident about what I'm doing with the command line.

Thank you everyone for all your hard work on this.

bbolli commented 5 years ago

Probably fixed by #140