TumblThreeApp / TumblThree

A Tumblr and Twitter Blog Backup Application
https://TumblThreeApp.github.io
MIT License
602 stars 72 forks source link

Doesn't dl images in the reblog of an answer post #533

Open rduwjjnh opened 4 months ago

rduwjjnh commented 4 months ago

Here's an example post: https://www.tumblr.com/fruitegg/659014555856437248/ My settings: image I've found the 659014555856437248 post only in the answers.txt, but the content is the same as of the original answer post (614541567356698624). And it only downloads the image in the 614541567356698624 post. image

It seems to work correctly with the reblogs of other types of posts.

Expected behavior I expect it to parse the content of the reblog of an answer post and download all images.

Desktop (please complete the following information):

thomas694 commented 4 months ago

That's by design. Files that have already been downloaded are not downloaded again in the same blog (or globally with setting). You recognize the skipped image download because you build unique filenames (e.g. %i). If you enable "dump crawler data", you'll see that both posts are actually processed/downloaded, but duplicate media is skipped.

rduwjjnh commented 4 months ago

Yes, both posts are processed, but it completely skips the content that's added in the reblog, this is what I showed on the second screenshot. These are the jsons it gives me if I enable dumping crawler data. The content of both posts is exactly the same, that of the original post, whereas there should be the new text and images in the reblog: Original post 614541567356698624.json Reblog 659014555856437248.json

In comparison, If I take a reblog of another type of post, for example - https://www.tumblr.com/fruitegg/685938465659060224/, there's new text and images in the json of the reblog, and all images are downloaded, as I expect. Original post 685828894320984064.json Reblog 685938465659060224.json

I understand that it skips duplicates, but those images weren't downloaded even once. If I search 659014555856437248, I only find that one image that's in the original answer post, and if I search 614541567356698624, there are no images with that post id. There are also no occurrences of the original links (64.media.tumblr.com/*) of the images from the reblog in any text files.

thomas694 commented 4 months ago

Ok, now I've seen it. When I looked in the JSONs and in the browser I saw the problem. For the reblogged answer, they render on their own HTML page more than they give us back in their data structure.

JSON ```

working full time doing backgrounds for an animation studio. mildly amusing given that i can count on maybe one hand the number of backgrounds ive completed in personal drawings

beyond that ive been doing embroidery and listening to a lot of united states chemical safety board videos. recently ive also gotten into eurobeat.

psd, progress gif on patreon

```
HTML ```

fruitegg:

working full time doing backgrounds for an animation studio. mildly amusing given that i can count on maybe one hand the number of backgrounds ive completed in personal drawings

beyond that ive been doing embroidery and listening to a lot of united states chemical safety board videos. recently ive also gotten into eurobeat.

image

psd, progress gif on patreon

prequel to dis pic

image
image
```

In this particular case, the images cannot be downloaded because we parse the data structure and not the HTML page. Maybe they'll fix this error one day.