downthemall / anticontainer

DownThemAll! AntiContainer (Extension to a Firefox, Seamonkey extension)
Mozilla Public License 2.0
93 stars 41 forks source link

Tumblr's Video Functionality #106

Closed MegaScience closed 8 years ago

MegaScience commented 8 years ago

Since Tumblr has natively hosted videos now (it's hard to search when this was added - Google insists on giving me people's posts instead of site news), I wanted to expand the plugin for the site to support it. This still isn't perfect, but it's what I have at the moment. If the page's meta elements define it as a video page, the video URL is deduced from the baseURL and video frame's URL.

(function() {
    try {
        var url = responseText.match(/<meta property="og:image" content="(.+?)"/)[1];
        var type = responseText.match(/<meta property="og:type" content="tumblr-feed:(.+?)"/)[1];
        var name = responseText.match(/<meta property="og:description" content="(.*?)"/);
        if(type == "video") {
            url = "https://www.tumblr.com/video_file/" + baseURL.match(/\/post\/([0-9]+)/)[1] + "/tumblr_" + url.match(/\/tumblr_([a-zA-Z0-9]+)_frame/)[1];
        }
        if (name && name[1].length) {
            var ext = url.replace(/\?.*$/, "").match(/\.[\w\d+]+$/);
            name = name[1].replace(/[^a-zA-Z0-9_\-]/gi, "_").substring(0, 100) + ((ext && ext[0]) || (type == "video" ? ".mp4" : ".jpg"));
        }
        setURL(url, name);
    }
    finally {
        finish();
    }
})();

I also added a filter for the filename, since page descriptions can contain invalid characters, which result in File Access errors. I'm not satisfied with this part in its current form - if only allows a-z, A-Z, and 0-9, and doesn't properly filter HTML Entities - but it works. It's also trimming the filename to 100 characters for safety's sake, even if they can technically be longer.

Of course, you can toss out my example and make your own iteration. Mainly, ability to retrieve videos from video pages. I'd still like the filenames to be filtered of invalid characters as well, just better. There's no native Javascript command for dealing with HTML entities, so I'd have to make the code far larger to properly work with them, so it might be better as a native command in AntiContainer.

Edit: I did find that filenames are being filtered while looking through the code, although the default filter isn't comprehensive enough to deal with the description text it is taking. I've actually removed the naming portion of the code in my version, since this makes filenames unpredictable and could result in duplicate files with different names.

MegaScience commented 8 years ago

I've been working on an entire, improved version. The code has gotten quite big, sadly, but it now covers various possible scenarios for post pages. For instance, this iteration can evaluate video pages and setURL for the correct path. I've also encountered mixed content pages (type of video, but with additional images), so I just send any images through queueDownload and setURL for the video. I would queueDownload for the video as well, but queueDownload's nameSuggestion parameter is currently broken, and it is required for videos as their URL does not contain an extension. (This issue is already reported separately.)

Tumblr pages have a script element containing JSON which can define paths for all images in the related post. It does not define the path to the video for video posts, so I compose it based on the page URL and URL of the provided video frame. I also have a fallback if the object is ever changed, whereas it will use meta's og:image as with the original behavior. Any errors should be properly logged, so if this stops working, you can get at least a general idea of the issue by enabling DTA's logging.

I also removed the renaming portion, as it was problematic on various levels. The area it was pulling the name from could have extremely long length with special characters. Filtering and trimming by the plugin would mean the names were now unique to the downloader, and duplicates could form if downloaded directly or should the plugin be updated. It is easier to maintain the actual filenames for consistency's sake.

(function() {
    try {
        var ogI = responseText.match(/<meta property="og:image" content="(.+?)"/i),
            type = responseText.match(/<meta property="og:type" content="(?:.+?:)?(.+?)"/i)[1],
            obj = responseText.match(/<script.*?type="application\/ld\+json">(.+?)<\/script>/i),
            url = null,
            name = null;
        if(!!obj && !!obj[1]) {
            obj = JSON.parse(obj[1]);
            var obT = typeof obj.image;
            if(obT === "string") {
                queueDownload(obj.image);
            }
            else if(obT === "object") {
                for(var i of obj.image["@list"])
                    queueDownload(i);
            }
            if(type === "video") {
                name = "tumblr_" + ogI[1].match(/\/tumblr_([a-zA-Z0-9]+)_frame/)[1];
                url = "https://www.tumblr.com/video_file/" + baseURL.match(/\/post\/([0-9]+)/)[1] + "/" + name;
                name = name + ".mp4";
            }
            else {
                throw new Error("Media not located in object.");
            }
        }
        else if(!!ogI && !!ogI[1]) {
            url = ogI[1];
        }
        else {
            throw new Error("Media not located in page.");
        }
        setURL(url, name);
    }
    catch (e) { log(e.message); }
    finally { finish(); }
})();

And here is the entire plugin:

{
  "resolve": "(function() {\n\ttry {\n\t\tvar ogI = responseText.match(/<meta property=\"og:image\" content=\"(.+?)\"/i),\n\t\t\ttype = responseText.match(/<meta property=\"og:type\" content=\"(?:.+?:)?(.+?)\"/i)[1],\n\t\t\tobj = responseText.match(/<script.*?type=\"application\\/ld\\+json\">(.+?)<\\/script>/i),\n\t\t\turl = null,\n\t\t\tname = null;\n\t\tif(!!obj && !!obj[1]) {\n\t\t\tobj = JSON.parse(obj[1]);\n\t\t\tvar obT = typeof obj.image;\n\t\t\tif(obT === \"string\") {\n\t\t\t\tqueueDownload(obj.image);\n\t\t\t}\n\t\t\telse if(obT === \"object\") {\n\t\t\t\tfor(var i of obj.image[\"@list\"])\n\t\t\t\t\tqueueDownload(i);\n\t\t\t}\n\t\t\tif(type === \"video\") {\n\t\t\t\tname = \"tumblr_\" + ogI[1].match(/\\/tumblr_([a-zA-Z0-9]+)_frame/)[1];\n\t\t\t\turl = \"https://www.tumblr.com/video_file/\" + baseURL.match(/\\/post\\/([0-9]+)/)[1] + \"/\" + name;\n\t\t\t\tname = name + \".mp4\";\n\t\t\t}\n\t\t\telse {\n\t\t\t\tthrow new Error(\"Media not located in object.\");\n\t\t\t}\n\t\t}\n\t\telse if(!!ogI && !!ogI[1]) {\n\t\t\turl = ogI[1];\n\t\t}\n\t\telse {\n\t\t\tthrow new Error(\"Media not located in page.\");\n\t\t}\n\t\tsetURL(url, name);\n\t}\n\tcatch (e) { log(e.message); }\n\tfinally { finish(); }\n})();",
  "keepReferrer": true,
  "prefix": "tumblr.com post (Custom)",
  "author": "MegaScience",
  "ns": "tumblr.com",
  "type": "sandbox",
  "match": "^http:\/\/([a-zA-Z0-9\\-]+\\.)?tumblr\\.com\/post\/"
}
nmaier commented 8 years ago

Can you do a proper pull request when ready?