Automate large-scale ad collection by topic, profile, demographic

dhowe commented 8 years ago

Requirements:

Collect ads for a specific profile/interest-set/demographic
Efficient harvesting (automated and in parallel)
Automated image downloading (named with page/profile/timestamp?)

Tasks:

Script to read an AdNauseam JSON file, and download the images (in parallel, python most likely, named with page/profile/timestamp?)
Batch image conversion (later)

Existing options:

https://github.com/citp/OpenWPM
http://webxray.org/
https://github.com/ACAHNN/adscape (*apparently dead)

References:

Barford, P. and I. Canadi and D. Krushevskaja and Q. Ma and S. Muthukrishnan. "Adscape: Harvesting and Analyzing Online Display Ads", in Proceedings of the World Wide Web Conference (WWW '14), Seoul, South Korea, April, 2014.
Carrascosa, Juan Miguel, et al. "I Always Feel Like Somebody's Watching Me. Measuring Online Behavioural Advertising." arXiv preprint arXiv:1411.5281 (2014).
Castelluccia, C. and M. Kaafar, and M. Tran. "Betrayed by Your Ads!", in Privacy Enhancing Technologies, p1-17. Springer, 2012.
Datta, Amit, Michael Carl Tschantz, and Anupam Datta. "Automated experiments on ad privacy settings." Proceedings on Privacy Enhancing Technologies 2015.1 (2015): 92-112.
Guha, S. and B. Cheng, and P. Francis. "Challenges in Measuring Online Advertising Systems", in Proceedings of the ACM SIGCOMM Internet Measurement Conference, p81-87. ACM, 2010.
Metwalley, Hassan, et al. "The online tracking horde: a view from passive measurements." International Workshop on Traffic Monitoring and Analysis. Springer International Publishing, 2015.
Metwalley, Hassan, Stefano Traverso, and Marco Mellia. "Online Trackers Demystified from Passive Measurements."
Roesner, F. and T. Kohno, and D. Wetherall. "Detecting and Defending Against Third-party Tracking on the Web", in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI), p12, Berkeley, CA, USA, 2012.

cqx931 commented 8 years ago

STEP 1 Getting a set of websites according to topic/demographic

(I will call this approach #1). From here we need to get to a set of websites. For some of these cases, we may be able to use ‘top * sites’, like ‘Top 100 Sports Sites’. Once we have a set of sites, then we can visit them and extract links to other sites (a normal crawling procedure). Ideally this process would continue indefinitely with more and more ads being found over time (there are some research projects that take a similar approach).

I had not thought of your idea of using ad-blocking lists (approach #2), but it is a good one. As I say above, the key is getting ads for a specific topic. Maybe we can find some topic-specific ad lists? Language is of course one very limited way of doing this…

Another way would be (#3) to start with a bunch of keywords, then search them on search engines and simply load the pages that the search engine returns (and possibly follow links from these pages as well).

A few thoughs about this:

Topic is easier than demographic, as topic links to keyword directly and demographic are a set of possible interesting points. We can start with topic and find a way to combine them into a demographic profile later (personally I find demographic profile more interesting...).

1.Top site list + crawling In the crawling process, are we going to follow all the links to other websites, or the range of the links is constrained within the topic area? One idea I have about crawling is to use this method to go through more content pages from a website. As many ads are in the content pages rather than homepage. And this process is also more similar to the real process from a user. By doing this we could increase the ads found from one website(other approaches could be repeat the process or do the same test set on another day), then a shorter website list is needed for the same amount of ads.

2.Ad-Blocking List My idea to use this is more to the direction of getting a long list of websites with visual ads... I didn't find any topic categorized ad-blocking lists, but one interesting thing we could do with these sites is to get the website meta info/or feed to google to see the first search result. After that, we could do an analysis of these short info and match them to specific categories according to keyword. But apparently more efforts and time are needed for this method.

cqx931 commented 8 years ago

Time needed for the first 1000 ads In sessbench, I set the wait time to 10s to collect the first 1000 ads. And as I went through approx. 800 sites, the running time is 8000 seconds, roughly 2 and half hours. But sessbench sometimes get stuck on specific site(I will try to find a few sample cases for this later), then I need to manually reset the session. So the total time was more close to 3 hours.

As I was not sure how many sites I need to go through to get 1000 ads, I did this 100 unit grouping, and feed them manually to sessbench this time. But if we solve the above problem and make sure sessbench runs smoothly, we can just copy paste the whole list...

dhowe commented 8 years ago

1.Top site list + crawling In the crawling process, are we going to follow all the links to other websites, or the range of the links is constrained within the topic area?

The idea is that by starting with a topic, say 'sports' sites or examples, and picking random links on those pages, we would stay, to some degree, within sports-related site. Of course, sometimes this will not happen, but as long as it happens some high enough percentage of the time. Also, remember that we will be tracked during these sessions, so assuming a clean profile at start, by starting with sports, our profile will lead us to more sports-related content, especially if we click ads some percentage of the time...

cqx931 commented 8 years ago

http://www.100topsportsites.com/

cqx931 commented 8 years ago

I just checked the three existing tools you mentioned, and I think we could use the extract_links function from OpenWPM to get a list of links as you mentioned above.

Though one thing I am not sure about is the ad clicking you described. When I did the 1000 ads testing using a Waiting time of 10 sec. I only got around 30 ads clicked when I first finished the whole collection. If we do want to have ads clicking to influence the profile, the waiting time should be longer, so that adnauseam can get some time to click the ads. Another question is: when adnauseam clicks the ad, the default setting is preventing the user from being tracked and the leak of any privacy related information. So if you want a process like what you described above, does that mean that I need to uncheck all of these from clicking ads, so that we let the profile to be tracked?

cqx931 commented 8 years ago

And I don't really understand what you mean by 'parallel' harvesting. Does it mean downloading the ad images at the same time when they are found? Where is this downloading process going to happen? background in AdNauseam as part of the automated mode? or is it seperated and in python?

2) Plan for efficient harvesting (automated and in parallel) ** 3) Script to read an adn JSON file, and download all the images (in parallel), probably python

cqx931 commented 8 years ago

My plan for next step is:

Use OpenWPM to crawl through top 100 sports site
Use the extract_links function from OpenWPM to get a list of websites
See how the result looks like (how many links, how well are they related to the topic).

dhowe commented 8 years ago

I just checked the three existing tools you mentioned, and I think we could use the extract_links function from OpenWPM to get a list of links as you mentioned above.

Following links is trivial -- any crawler can do this...

Another question is: when adnauseam clicks the ad, the default setting is preventing the user from being tracked and the leak of any privacy related information. So if you want a process like what you described above, does that mean that I need to uncheck all of these from clicking ads, so that we let the profile to be tracked?

Yes, this we need to think of this as separate from how AdNauseam usually works (in fact, AdNauseam may not be the best tool to use for harvesting ads). But if we do use it, then we want to disable all the privacy protections, so that the profile is reinforced as it runs.

And I don't really understand what you mean by 'parallel' harvesting.

I mean that we should be able to send multiple requests at once, rather than waiting or each to finish before doing the next (see the NUM_BROWSERS settings in OpenWPM here. For the case of session-bench, if we have 100 URLs, there is no reason not to start all 100 requests at the same time, or perhaps 10 at a time. If we are crawling a page in python, and find 10 links, then we can visit all 10 simultaneously. This is how crawler's usual work. The only counter-argument, for our case, is when we want to simulate ads for a real user profile. In this case, we need to mimic user behavior, which generally means loading one page a time.

A separate question is about downloading the images (which adn currently does not do, though I think it probably should, at least as an option). So we can add some code to adn to enable this, or we can simple write a script using bash/jq/python or whatever to parse an exported JSON file, then download all the images (this is the quicker solution, but doesn't help with ads going stale in the vault in adn)

Lets discuss before you start (I'm not sure thats the right plan) -- meantime, take a look at the adscape paper...

cqx931 commented 8 years ago

Using the script that I pushed to AdCollector, I tried two slightly different approach for two categories for experiment: 1.Keyword: Finance, Business, credit card, mortgage, loan. For this I used the approach we discussed in school, first doing a few google seach with above keywords and then go through the url list I fetched from Easylist. I could see some slight influence of the google search on the ads it gathered, but this is not very obvioius when it is presented in image collage.

2.Keyword: Game, Entertainment, Movie, tv, video In this experiment, I filtered the url list from easylist with these keywords and combine this with those top100 sites lists online to create a specific url lists to go through. This time, the images are obviously more related to the topic but the amount of images is only half of the amount I could get from the first experiment. The topic should have an influence on this, at the same time, the possibility to get visual ads from top 100 sites lists is also lower.

For both approach I have 8 parallel sessions and each 100 urls.The running time is around 1hour.

Problem to solve: 1.Sometimes the website still get stuck and can't go through the whole test.

As the next step, I can work on the script to read and combine several JSON files as well as download the images. Meanwhile I can also try to do some analysis of the result from the JSON file.

Another possible direction to go is to browse more pages within the websites to increase the amount of ad collected form one website.

dhowe commented 8 years ago

Questions:

browser settings (flash, cookies etc) -- is it all default ?
more info about how script get 'stuck'

Next tries:

with a google login
following links from original pages see example

Scripts:

Use jq to a) combine JSON, and b) generate a list of image URLS (in a text file, one per line)
Script that accepts text-file above, and does parallel image collection (python?)

cqx931 commented 8 years ago

1.Browser settings: All default: flash enabled, 3rd party cookies allowed. 2.Browser get 'stuck' Slenium stop on certain page and doesn't go to the next test. Chrome sometimes pop up "do you want to leave the page", this could be one reason why it get stuck.

cqx931 commented 8 years ago

The current two profiles I tested with. Both location are set to HK, so these two could be used for the "asian male" model. Do we have any preference of topics for the two models for the exhibition?

dhowe commented 8 years ago

Can you post the full list of topics somewhere... ? I think these selections are a bit too generic (I work with models that have very specific interests)

cqx931 commented 8 years ago

Here is the full list: https://support.google.com/adwords/answer/156178?hl=en

dhowe commented 8 years ago

This issue was moved to dhowe/AdCollector#5

dhowe / AdNauseam

Automate large-scale ad collection by topic, profile, demographic #612