Automate large-scale ad collection by topic, profile, demographic

dhowe commented 7 years ago

From @dhowe on November 10, 2016 17:35

Requirements:

Collect ads for a specific profile/interest-set/demographic
Efficient harvesting (automated and in parallel)
Automated image downloading (named with page/profile/timestamp?)

Tasks:

Script to read an AdNauseam JSON file, and download the images (in parallel, python most likely, named with page/profile/timestamp?)
Batch image conversion (later)

Existing options:

https://github.com/citp/OpenWPM
http://webxray.org/
https://github.com/ACAHNN/adscape (*apparently dead)

References:

Barford, P. and I. Canadi and D. Krushevskaja and Q. Ma and S. Muthukrishnan. "Adscape: Harvesting and Analyzing Online Display Ads", in Proceedings of the World Wide Web Conference (WWW '14), Seoul, South Korea, April, 2014.
Carrascosa, Juan Miguel, et al. "I Always Feel Like Somebody's Watching Me. Measuring Online Behavioural Advertising." arXiv preprint arXiv:1411.5281 (2014).
Castelluccia, C. and M. Kaafar, and M. Tran. "Betrayed by Your Ads!", in Privacy Enhancing Technologies, p1-17. Springer, 2012.
Datta, Amit, Michael Carl Tschantz, and Anupam Datta. "Automated experiments on ad privacy settings." Proceedings on Privacy Enhancing Technologies 2015.1 (2015): 92-112.
Guha, S. and B. Cheng, and P. Francis. "Challenges in Measuring Online Advertising Systems", in Proceedings of the ACM SIGCOMM Internet Measurement Conference, p81-87. ACM, 2010.
Metwalley, Hassan, et al. "The online tracking horde: a view from passive measurements." International Workshop on Traffic Monitoring and Analysis. Springer International Publishing, 2015.
Metwalley, Hassan, Stefano Traverso, and Marco Mellia. "Online Trackers Demystified from Passive Measurements."
Roesner, F. and T. Kohno, and D. Wetherall. "Detecting and Defending Against Third-party Tracking on the Web", in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI), p12, Berkeley, CA, USA, 2012.

Copied from original issue: dhowe/AdNauseam#612

dhowe commented 7 years ago

From @cqx931 on November 11, 2016 2:39

STEP 1 Getting a set of websites according to topic/demographic

(I will call this approach #1). From here we need to get to a set of websites. For some of these cases, we may be able to use ‘top * sites’, like ‘Top 100 Sports Sites’. Once we have a set of sites, then we can visit them and extract links to other sites (a normal crawling procedure). Ideally this process would continue indefinitely with more and more ads being found over time (there are some research projects that take a similar approach).

I had not thought of your idea of using ad-blocking lists (approach #2), but it is a good one. As I say above, the key is getting ads for a specific topic. Maybe we can find some topic-specific ad lists? Language is of course one very limited way of doing this…

Another way would be (#3) to start with a bunch of keywords, then search them on search engines and simply load the pages that the search engine returns (and possibly follow links from these pages as well).

A few thoughs about this:

Topic is easier than demographic, as topic links to keyword directly and demographic are a set of possible interesting points. We can start with topic and find a way to combine them into a demographic profile later (personally I find demographic profile more interesting...).

1.Top site list + crawling In the crawling process, are we going to follow all the links to other websites, or the range of the links is constrained within the topic area? One idea I have about crawling is to use this method to go through more content pages from a website. As many ads are in the content pages rather than homepage. And this process is also more similar to the real process from a user. By doing this we could increase the ads found from one website(other approaches could be repeat the process or do the same test set on another day), then a shorter website list is needed for the same amount of ads.

2.Ad-Blocking List My idea to use this is more to the direction of getting a long list of websites with visual ads... I didn't find any topic categorized ad-blocking lists, but one interesting thing we could do with these sites is to get the website meta info/or feed to google to see the first search result. After that, we could do an analysis of these short info and match them to specific categories according to keyword. But apparently more efforts and time are needed for this method.

dhowe commented 7 years ago

From @cqx931 on November 11, 2016 2:53

Time needed for the first 1000 ads In sessbench, I set the wait time to 10s to collect the first 1000 ads. And as I went through approx. 800 sites, the running time is 8000 seconds, roughly 2 and half hours. But sessbench sometimes get stuck on specific site(I will try to find a few sample cases for this later), then I need to manually reset the session. So the total time was more close to 3 hours.

As I was not sure how many sites I need to go through to get 1000 ads, I did this 100 unit grouping, and feed them manually to sessbench this time. But if we solve the above problem and make sure sessbench runs smoothly, we can just copy paste the whole list...

dhowe commented 7 years ago

1.Top site list + crawling In the crawling process, are we going to follow all the links to other websites, or the range of the links is constrained within the topic area?

The idea is that by starting with a topic, say 'sports' sites or examples, and picking random links on those pages, we would stay, to some degree, within sports-related site. Of course, sometimes this will not happen, but as long as it happens some high enough percentage of the time. Also, remember that we will be tracked during these sessions, so assuming a clean profile at start, by starting with sports, our profile will lead us to more sports-related content, especially if we click ads some percentage of the time...

dhowe commented 7 years ago

From @cqx931 on November 11, 2016 9:23

http://www.100topsportsites.com/

dhowe commented 7 years ago

From @cqx931 on November 11, 2016 10:49

I just checked the three existing tools you mentioned, and I think we could use the extract_links function from OpenWPM to get a list of links as you mentioned above.

Though one thing I am not sure about is the ad clicking you described. When I did the 1000 ads testing using a Waiting time of 10 sec. I only got around 30 ads clicked when I first finished the whole collection. If we do want to have ads clicking to influence the profile, the waiting time should be longer, so that adnauseam can get some time to click the ads. Another question is: when adnauseam clicks the ad, the default setting is preventing the user from being tracked and the leak of any privacy related information. So if you want a process like what you described above, does that mean that I need to uncheck all of these from clicking ads, so that we let the profile to be tracked?

dhowe commented 7 years ago

From @cqx931 on November 11, 2016 10:55

And I don't really understand what you mean by 'parallel' harvesting. Does it mean downloading the ad images at the same time when they are found? Where is this downloading process going to happen? background in AdNauseam as part of the automated mode? or is it seperated and in python?

2) Plan for efficient harvesting (automated and in parallel) ** 3) Script to read an adn JSON file, and download all the images (in parallel), probably python

dhowe commented 7 years ago

From @cqx931 on November 11, 2016 11:5

My plan for next step is:

Use OpenWPM to crawl through top 100 sports site
Use the extract_links function from OpenWPM to get a list of websites
See how the result looks like (how many links, how well are they related to the topic).

dhowe commented 7 years ago

I just checked the three existing tools you mentioned, and I think we could use the extract_links function from OpenWPM to get a list of links as you mentioned above.

Following links is trivial -- any crawler can do this...

Another question is: when adnauseam clicks the ad, the default setting is preventing the user from being tracked and the leak of any privacy related information. So if you want a process like what you described above, does that mean that I need to uncheck all of these from clicking ads, so that we let the profile to be tracked?

Yes, this we need to think of this as separate from how AdNauseam usually works (in fact, AdNauseam may not be the best tool to use for harvesting ads). But if we do use it, then we want to disable all the privacy protections, so that the profile is reinforced as it runs.

And I don't really understand what you mean by 'parallel' harvesting.

I mean that we should be able to send multiple requests at once, rather than waiting or each to finish before doing the next (see the NUM_BROWSERS settings in OpenWPM here. For the case of session-bench, if we have 100 URLs, there is no reason not to start all 100 requests at the same time, or perhaps 10 at a time. If we are crawling a page in python, and find 10 links, then we can visit all 10 simultaneously. This is how crawler's usual work. The only counter-argument, for our case, is when we want to simulate ads for a real user profile. In this case, we need to mimic user behavior, which generally means loading one page a time.

A separate question is about downloading the images (which adn currently does not do, though I think it probably should, at least as an option). So we can add some code to adn to enable this, or we can simple write a script using bash/jq/python or whatever to parse an exported JSON file, then download all the images (this is the quicker solution, but doesn't help with ads going stale in the vault in adn)

Lets discuss before you start (I'm not sure thats the right plan) -- meantime, take a look at the adscape paper...

dhowe commented 7 years ago

From @cqx931 on November 20, 2016 3:55

Using the script that I pushed to AdCollector, I tried two slightly different approach for two categories for experiment: 1.Keyword: Finance, Business, credit card, mortgage, loan. For this I used the approach we discussed in school, first doing a few google seach with above keywords and then go through the url list I fetched from Easylist. I could see some slight influence of the google search on the ads it gathered, but this is not very obvioius when it is presented in image collage.

2.Keyword: Game, Entertainment, Movie, tv, video In this experiment, I filtered the url list from easylist with these keywords and combine this with those top100 sites lists online to create a specific url lists to go through. This time, the images are obviously more related to the topic but the amount of images is only half of the amount I could get from the first experiment. The topic should have an influence on this, at the same time, the possibility to get visual ads from top 100 sites lists is also lower.

For both approach I have 8 parallel sessions and each 100 urls.The running time is around 1hour.

Problem to solve: 1.Sometimes the website still get stuck and can't go through the whole test.

As the next step, I can work on the script to read and combine several JSON files as well as download the images. Meanwhile I can also try to do some analysis of the result from the JSON file.

Another possible direction to go is to browse more pages within the websites to increase the amount of ad collected form one website.

dhowe commented 7 years ago

Questions:

browser settings (flash, cookies etc) -- is it all default ?
more info about how script get 'stuck'

Next tries:

with a google login
following links from original pages see example

Scripts:

Use jq to a) combine JSON, and b) generate a list of image URLS (in a text file, one per line)
Script that accepts text-file above, and does parallel image collection (python?)

dhowe commented 7 years ago

From @cqx931 on November 20, 2016 10:13

1.Browser settings: All default: flash enabled, 3rd party cookies allowed. 2.Browser get 'stuck' Slenium stop on certain page and doesn't go to the next test. Chrome sometimes pop up "do you want to leave the page", this could be one reason why it get stuck.

dhowe commented 7 years ago

From @cqx931 on November 21, 2016 4:16

The current two profiles I tested with. Both location are set to HK, so these two could be used for the "asian male" model. Do we have any preference of topics for the two models for the exhibition?

dhowe commented 7 years ago

Can you post the full list of topics somewhere... ? I think these selections are a bit too generic (I work with models that have very specific interests)

dhowe commented 7 years ago

From @cqx931 on November 21, 2016 5:41

Here is the full list: https://support.google.com/adwords/answer/156178?hl=en

dhowe commented 7 years ago

Male I 40 (HK): Porn, Babies/Nursery, Fine Art Male II, 28 (HK): Dating, Games, Exotic Travel, Extreme Sports Male III 24 (HK): Credit Cards, Loans, Cars
Female I, 53 (US): Right-wing Politics (Alt-right, Trump, Brexit, Refugees, Immigration, Nationalism) Female II, 32 (Europe/Netherlands): Literature, Film, Marriage, Pregnant, Education Female III, 50 (Europe/Netherlands): Job Search, Academia, Environment, Credit-Cards

dhowe commented 7 years ago

From @cqx931 on November 21, 2016 10:18

Here comes two detailed Preference setting and url lists:

Female, 32 (Europe/Netherlands)

http://www.100topsportsites.com/?cat=Women

Literature category::Books & Literature category::Books & Literature>Book Retailers category::Books & Literature>Children's Literature category::Books & Literature>E-Books category::Books & Literature>Literary Classics category::Books & Literature>Poetry

links: http://www.100topsportsites.com/?cat=Libraries

Film category::Arts & Entertainment>Events & Listings>Film Festivals category::Arts & Entertainment>Movies>Classic Films category::Arts & Entertainment>Movies>Classic Films>Silent Films category::Arts & Entertainment>Movies>Cult & Indie Films category::Arts & Entertainment>Movies>Documentary Films category::Arts & Entertainment>Movies>Drama Films category::Arts & Entertainment>Movies>Family Films category::Arts & Entertainment>Movies>Movie Reference>Movie Reviews & Previews category::Arts & Entertainment>Movies>Musical Films category::Arts & Entertainment>Movies>Romance Films

http://www.100topsportsites.com/?cat=Movies http://www.100topsportsites.com/?cat=Movie%20Reviews

Marriage category::People & Society>Family & Relationships>Marriage category::People & Society>Family & Relationships>Troubled Relationships

No Particularly good idea of the website list for this one. One guess: https://www.buzzfeed.com/erinlarosa/22-websites-that-make-wedding-planning-so-much-easier?utm_term=.shzRDGw3y#.hdX4MyDma

Pregnant category::People & Society>Family & Relationships>Family>Parenting>Pregnancy & Maternity list: http://www.kidsinthehouse.com/forum-topic/top-100-parenting-blogs-websites http://www.100topsportsites.com/?cat=Parenting

Education category::Jobs & Education>Education>Vocational & Continuing Education category::Jobs & Education>Education>Vocational & Continuing Education>Computer Education category::Jobs & Education>Education>Early Childhood Education category::Jobs & Education>Education>Homeschooling

http://www.100topsportsites.com/?cat=EducationMale I 40 (HK): Porn, Babies/Nursery, Fine Art

Male I 40 (HK): Porn, Babies/Nursery, Fine Art

Porn 106 site list filtered from Easylist with the keyword “porn” http://mypornbible.com/

Babies/Nursery category::People & Society>Family & Relationships>Family>Baby & Pet Names category::People & Society>Family & Relationships>Family>Parenting category::People & Society>Family & Relationships>Family>Parenting>Babies & Toddlers category::People & Society>Family & Relationships>Family>Parenting>Babies & Toddlers>Baby Care & Hygiene category::People & Society>Family & Relationships>Family>Parenting>Child Care

list: http://www.kidsinthehouse.com/forum-topic/top-100-parenting-blogs-websites http://www.100topsportsites.com/?cat=Parenting

Fine Art category::Arts & Entertainment>Visual Art & Design>Painting category::Arts & Entertainment>Visual Art & Design>Photographic & Digital Arts category::Arts & Entertainment>Performing Arts

list: http://knowledgelover.com/interesting-websites-for-art-lovers/ http://www.100topsportsites.com/?cat=Art

dhowe commented 7 years ago

From @cqx931 on November 21, 2016 14:19

Female III, 50 (Europe/Netherlands): Job Search, Academia, Environment, Credit-Cards

This profile looks more like a 20 something undergraduate/graduate student for me?

I am imagining an older, academic who has recently lost their job

cqx931 commented 7 years ago

Could you send me the details about proxy before the weekend? I think I need to start testing on the proxy during the weekend.

dhowe commented 7 years ago

Well, you can specify a proxy via nightwatch, but all you really need to do is setup chrome (or globally for the OS) to use a proxy and then nightwatch will use it as well. Or do you mean info on a specific proxy?

cqx931 commented 7 years ago

So by clicking within each website from a narrowed down list (from previous ad collecting that goes through the whole website list of the corresponding genre), I was able to collect the following amount of ads:

Female II, 32 (Europe/Netherlands): Literature 110 Film 141 Marriage 100 Pregnant/Parenting 150 Education 65 Total:566

Male I 40 (HK): Porn:151 Babies/Nursery:132 Fine Art:87 Total:370

Within each categories the ads are very closely related to each topic. (70-90%) I can show you the images tomorrow if you will be at school.

For the male profile, could we add one or two more categories or do you prefer to keep it more concentrated? It's harder to get the same amount of ads with less categories.

How many ads could we use from your ads collection from ad servers? To keep increasing the amount of ads, I can also try the following approaches: 1.improve the website lists 2.run the same collection process again later and combine the image collections. 3.Increase the amount of links visited per each site. Currently it is 20. The performance of increasing this variable varies among sites. For sites with a wide variety of adsy( Ex: nytimes) this is very useful.

run profiles on news sites (if the final ads collection doesn't need to be so highly related to the topics)

*PS: Currently the two profiles have a bit overlaps in the list of parenting websites, some of the images are the same but since they are from different countries with different gender, there are also images that are unique to each profile.

dhowe commented 7 years ago

Sounds great. Indeed lets go over tomorrow...

janetthecoder commented 6 years ago

Hi Dhowe I'd really like to interact with you. I wanted to conduct a few experiments with how websites profile and advertise to people.

Could we interact somehow? I would love to hear your views.

dhowe / AdCollector

Automate large-scale ad collection by topic, profile, demographic #5

Female, 32 (Europe/Netherlands)

Male I 40 (HK): Porn, Babies/Nursery, Fine Art