dhowe / AdNauseam

AdNauseam: Fight back against advertising surveillance
GNU General Public License v3.0
4.52k stars 189 forks source link

Add option to "include images" when exporting ads #639

Open dhowe opened 7 years ago

dhowe commented 7 years ago

Should download to a folder (or zip) in same location as the json file, with same name as json file, minus the extension

Related to #252 (though its not crucial in this case that we grab images from the cache)

leoneckert commented 7 years ago

working on this right now. Downloading into a folder is not possible (because security, chrome doesn't let you define paths to store things). Trying out zips now, also because it's nicer to have just one download popping up for ALL the images, rather than 100+ individual downloads.

dhowe commented 7 years ago

there are many image downloaders available which must do the same: https://chrome.google.com/webstore/search/image%20downloader?hl=en

leoneckert commented 7 years ago

This seems almost solved - though I spent the last 2 hours on callback complications that I could still not fix.

For above reasons I found exporting the images is the right solution. I worked the jsZip library(this seems to be used a lot), including it in lib/ and linking to it in background.html (of course I didn't PullRequest that yet, just experimenting).

The library makes it easy to form zips containing an image folder and download them with another library FileSaver.js (for this one I am sure for this one there is alternatives), but requires them to be converted into base64 format first. For this I found two ways, that I tested. 1) one method using XMLHttpRequest, then transforming the response using Uint8Array, String.fromCharCode and btoa(). This seemed to throw stack errors when working with too many images. 2) the second method I tested seems to be more common and works with a canvas:

    function getBase64(url) {
        var image = new Image();
        image.onload = function () {
            var canvas = document.createElement('canvas');

    canvas.width = this.naturalWidth; // or 'width' if you want a special/scaled size
            canvas.height = this.naturalHeight; // or 'height' if you want a special/scaled size
            canvas.getContext('2d').drawImage(this, 0, 0);
            var rawData = canvas.toDataURL('image/png').replace(/^data:image\/(png|jpg);base64,/, '');

            return rawDate
        };
        image.src = url;
    }

After that, jsZip let's you create a folder (which I gave the same name at the json file minus 'json' and add files like this:

var zip = new JSZip();
var dir = zip.folder([foldername]);

next, and that's where callbacks were my big trouble, I wanted to loop over the image sources and add them to the folder:

for src:
    dir.file([imagename], getBase64(src), {base64: true});

after this the zip gets downloaded this way:

fullZip.generateAsync({type:"blob"})
            .then(function(content) {
                saveAs(content, "example.zip");
            });

I tried all the methods I could think of with this with different returns and/or callback logics etc., but it always came down to the base64 to not be created quickly enough and the program jumping ahead storing either zero-byte-images in the zip or no images at all respectively.

i wonder, if maybe we could get around this problem with the chrome.storage that we already talked about, in which we can save images at lower pace as they come in, and then, if the user wishes to download/export the actual image files, pull them all from there.

If there is any ultimate Javascript callback problem workaround that you guys know of please let me know and I can give that a try, too.

dhowe commented 7 years ago

I don't understand. Is the problem that you code will not wait until the zip is created before downloading it? Also, once the zip is created, downloading should be simple (not sure why we need another library); we already do this with the json files (should be no different for a zip)

leoneckert commented 7 years ago

1) I create the zip, 2) create a folder inside it, 3) then, for each url, I create base64 imageData and append it to the folder

dhowe commented 7 years ago

I don't see where you are actually downloading the image from the remote site (or are you running this code from vault.js, where the images already exist)? If the latter, then the problem should be no different than any of the many image-downloader extensions, each of which downloads images from the current web page into a folder (usually without first creating a zip, and/or converting to base 64)...

But generally speaking, if you need help, you should create a topic branch, push it to github, then include a link (or at least the branch name) so that others can examine... otherwise we are only guessing

leoneckert commented 7 years ago

okay, in the getBase64(url) function I posted above, the pixels of the image are actually downloaded from remote (through creating an image element and adjusting its src) and put onto a canvas in order to convert that into base64. The other method I tried using XMLHttpRequest and manipulating the response requests the data in a more obvious manner - but both do download from remote, all happening in core.js.

from my research downloading into a folder is not possible, unless that folder is created in JS and then downloaded as one zip. Downloading images individually into whatever the default Download folder is, is a real mess when having more than 5 or so img ads collected.

I can give the method you mention a try to download them directly from the vault dom. - if that doesn't work I'll create a topic branch as you say.

leoneckert commented 7 years ago

+ I'll go through one of this image extensions. Though they seem to render images on their interface first too, before giving the download as zip option. Not doing the remote load AND local download in one step - so doing it from vault dom seem good.

screen shot 2016-11-23 at 14 42 12
leoneckert commented 7 years ago

update: figured it out! the code from the Extension above helped a lot solving that callback problem with a timeout waiting for the base64 string to be created. Sending a PR shortly so you can test

leoneckert commented 7 years ago

So above PR is still in development. The basis of it should be the right approach, it still produces error zip files with many photos, which I think is not due to their amount, but due to messy urls or names that have to be sanitised. I can get back to this on the weekend.

leoneckert commented 7 years ago

This is not meant to be merged of course, just to have a look and/or test.

dhowe commented 7 years ago

Ok. I will leave you too it then, unless there is something specific you want me to look at... The code here looks like it might have potential as well

please use the regular chrome.downloads API for the actual zip download

leoneckert commented 7 years ago

This branch https://github.com/leoneckert/AdNauseam/tree/zipExportImages2 works*. If you could test and tell me enhancements to implement would be great. One clear thing is that it should be optional (as the title of this ticket says). For the time being, I just used this line for testing: if (true) saveVaultImages(filename); to make it always download both the json and the images as a zip.

I spend a lot of time on this today again figuring out the following:

I started using base64 creation techniques from this extension (which itself is working great downloading multiple images in a zip) together with the jszip library I mentioned above and the vAPI download function we use in ADN. Looking at the extension solved all problems I had regarding callbacks using a setInterval before calling a specific function. However I still ran into problems when it came to amount if images in a zip - whatever I tried, 18 images seemed to be the highest number of images I could fit without Chrome throwing this "Network error".

screen shot 2016-11-29 at 01 00 40

I then tried to use the exact zipping code from the extension (instead of jszip), but that didn't help. I also tried creating multiple folders in the zip or to create multiple zips (the latter worked but shouldn't be an option).

Finally I looked at the permissions the extension asks for which includes "downloads". I added that to our manifest and looked it up in the chrome docs, learning that it "gives access to use the chrome.downloads api". This api uses a different download function that ours (which I know you told me to use). it looks like this:

chrome.downloads.download({
   url: "data:application/zip;base64,"+content,
   filename: zipName+".zip" // Optional
});

instead of

vAPI.download({
   'url': "data:application/zip;base64,"+content,
   'filename': zipName+"_"+zipcount
});

... with both manifest permission and the first download technique above, it worked. Not with vAPI.download. Is that a problem? I am not able to tell what about vAPI.download it is that throws the error, my only vague guess based on forum posts is a missing "content-length" header in the download.

* please let me know if I still got the sharing-branch instead of sending-PRs wrongly.

dhowe commented 7 years ago

@cqx931 can you check the ticket above in this branch: https://github.com/leoneckert/AdNauseam/tree/zipExportImages2 ?

The goal is to optionally download all images with each ad export (note: we don't have a UI for this, but want the feature to work before we start integrating it as an option)

cqx931 commented 7 years ago

@leoneckert nice work! Though I had some trouble by this line of code during my test: https://github.com/leoneckert/AdNauseam/blob/zipExportImages2/src/js/adn/core.js#L1656

Data like this would not trigger image.onload() and will result in a smaller file length than img length. Therefore no images would be downloaded.

"smh.com.au::http://www.smh.com.au/": { "id": 59, "attempts": 0, "visitedTs": 0, "attemptedTs": 0, "contentData": { "src": "http://www.smh.com.au/", "width": -1, "height": -1 }, "contentType": "img", "title": "Pending", "resolvedTargetUrl": null, "foundTs": 1479551247605, "targetUrl": "http://optimized-by.rubiconproject.com/t/7856/77544/365260-2.3253545.3289557?url=http%3A%2F%2Fitunes.apple.com%2Fus%2Fapp%2Fstuff-co-nz%2Fid383545689%3Fmt%3D8", "pageTitle": "http://www.smh.com.au/", "pageUrl": "http://www.smh.com.au/", "errors": null, "current": true, "pageDomain": "smh.com.au", "version": "2.3.87", "targetDomain": "rubiconproject.com" }

When I now go to smh.com.au I don't see this data being collected. (Might caused by a specific kind of ads?) But to solve the downloading problem, the most appropriate approach I think is to verify valid image url when collecting the ad data, as those situations are not valid ads and it makes no sense to gather the data anyway.

Something like this: function checkImageURL(url) { return(url.match(/\.(jpeg|jpg|gif|png)$/) != null); }

PS: in the latest commit the manifest.json file used option_ui.html, whereas the actual option file didn't change it's name/option_ui.html testing page hasn't been created yet. Anyway, this is minor, I just noticed this when I tried to install the extension.

dhowe commented 7 years ago

So what issues remain, besides the UI itself?

cqx931 commented 7 years ago

The above issue, download fails if there is invalid img url. And if the imgs are over 200, I see "Failed, System error" when downloading.

dhowe commented 7 years ago

Over 200k? And they also fail without a file extension?

We should be able to try/catch all these errors

leoneckert commented 7 years ago

good catch(es) @cqx931 ! I guess you mean to implement the checkImgURL on a lower level - when the ads are collected, not within the downloading function correct? Otherwise I could do the check in this loop:

    var imgURLs = []; // extract image urls
    adlist().forEach(function (ad) {
      if (ad.contentType === 'img')
        imgURLs.push(ad.contentData.src);
    });

Do you mean if there are more than 200 images, then it doesn't work? A simple solution to that would be to download in batches of 200, though hopefully there is a more elegant way...

EDIT: I just checked and for 300 images it gave me the "Failed, System busy" error, too. Cutting down to under 200, it worked. So yes, that's a problem.

cqx931 commented 7 years ago

the contentType of the case that I mentioned is still img, but there might be some possible failure when loading the image which causes the unmatching of the file length and img length.

Daniel mentioned that we don't want to check the image suffix, as a lot of the ad images don't have it. So you might want to catch the failing img.onLoad error and count that in when you try to see whether the loading is finished or not with the if statement.

leoneckert commented 7 years ago

@cqx931 sounds good, I can work on that right now!

leoneckert commented 7 years ago

I just updated the branch with a catch for images that don't load. - It turned out try catch did not work for that, but rather a img.onerror function like this:

(function(url){
    var parts = url.split("/");
    var filename = parts[parts.length - 1];
    // for now:
    filename = "image_"+i+".jpg"
    var img = new Image();
    img.onload = function() {
        var a = document.createElement('a');
    a.href = this.src;

    // Add to file array [{name, data}]
    files.push({name: filename, data: getBase64Image(img) });
    }
    img.onerror = function() {
        var index = imgURLs.indexOf(url);
        if (index > -1) {
            imgURLs.splice(index, 1);
        }
    }
    img.src = url;
})(url);
screen shot 2016-12-13 at 22 59 14
cqx931 commented 7 years ago

@leoneckert I have checked your latest update, and your solution is working quite well with the image onload problem. So now the download function works so long as the images are less than approx. 300,with a total file size less than approx. 20MB.

I have looked up this probelm a little bit and tried to get some more information about the failing download by getting its download id and check the object. Though the information I can get is still very limited. This image is for a failing download: image And this one is for a successful download: image

You can see that chrome failed to get the correct file size and mime information for the failing download. And I see no error message.

cqx931 commented 7 years ago

https://github.com/dhowe/AdNauseam/pull/733 With the latest update, the max image output has been pushed to approx. 725 images.(70MB)

If it goes more than that, AdNauseam just crushes...I have created a new ticket for this issue, https://github.com/dhowe/AdNauseam/issues/734

We shall show some loading/process bar to the user for the downloading interface, the waiting time is quite long when there are many images. The checking process bar should be easy, but after the checking process, it still takes a while from vAPI.download till the real download file shows up. I'm not sure whether there is a way to tell the user how long they need to wait for that.