Facebook Ads Robot - Githubissues

andreladocruz commented 6 years ago

Hi friends,

First, sorry for not following the right process to add a new crawler. I'm a little newbie in this world.

I just found that facebook is using a new structure to crawle ads links.

Look the user-agents that I got here:

Mozilla\/5.0 (Linux; Android 6.0.1; SM-J500M Build\/MMB29M; wv) AppleWebKit\/537.36 (KHTML, like Gecko) Version\/4.0 Chrome\/62.0.3202.84 Mobile Safari\/537.36 [FB_IAB\/FB4A;FBAV\/152.0.0.42.136;]

Mozilla\/5.0 (Linux; Android 6.0.1; SM-J700M Build\/MMB29K; wv) AppleWebKit\/537.36 (KHTML, like Gecko) Version\/4.0 Chrome\/62.0.3202.84 Mobile Safari\/537.36 [FB_IAB\/FB4A;FBAV\/152.0.0.42.136;]

Mozilla\/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit\/603.3.8 (KHTML, like Gecko) Mobile\/14G60 [FBAN\/FBIOS;FBAV\/150.0.0.32.132;FBBV\/80278251;FBDV\/iPhone6,2;FBMD\/iPhone;FBSN\/iOS;FBSV\/10.3.3;FBSS\/2;FBCR\/O2;FBID\/phone;FBLC\/en_GB;FBOP\/5;FBRV\/0]

Mozilla\/5.0 (iPhone; CPU iPhone OS 11_1_2 like Mac OS X) AppleWebKit\/604.3.5 (KHTML, like Gecko) Mobile\/15B202 [FBAN\/FBIOS;FBAV\/151.0.0.61.202;FBBV\/82156572;FBDV\/iPhone6,2;FBMD\/iPhone;FBSN\/iOS;FBSV\/11.1.2;FBSS\/2;FBCR\/TIM;FBID\/phone;FBLC\/pt_BR;FBOP\/5;FBRV\/83160404]

I just attached a XLS file with all user-agents I got here.

facebook-ads-crawlers.xlsx

Can someone help me to add it to the crawler list? =)

JayBizzle commented 6 years ago

@andreladocruz thanks for getting in touch with this data, very informative.

Think we will need to do some research on this before adding any new users agents. FB_IAB stands for Facebook In-App Browser so we don't want to potentially detect genuine visits from the in-app browser.

We have a large site with lots of user-agent data and have just taken a quick look, and a lot of the agents we have tracked with the FB_IAB string also send extra headers like this

{"HTTP_X_FB_HTTP_ENGINE":"Liger","HTTP_X_FB_NET_HNI":"23410","HTTP_X_FB_SIM_HNI":"23410"}

Think it is something to do with Facebook prefetching websites.

See here for some more details http://inchoo.net/dev-talk/mitigating-facebook-x-fb-http-engine-liger/

@MaxGiting what do you think. Didn't we look into this recently. Isn't this what triggered you to start working on #218?

Perhaps it's time to look into that again and get the prefetch checking stuff implemented?

andreladocruz commented 6 years ago

@JayBizzle,

Thanks for the reply.

In my app, if the method isCrawler() returns false, I check if the user agent contains FBAV.

I found some docs about this:

https://buildtoship.com/filtering-facebook-search-spiders-bots-and-other-automated-requests-fb_iab/

If you filter the XLS with FB_IAB, you will find that the IOS devices do not shown.

That's why I used the FBAV and, as explained in the link above, the number of request dropped down.

Let's keep digging into it =)

andreladocruz commented 6 years ago

Guys,

Just to let you know that filtering by FBAV were to aggressive.

I just switched to FB_IAB as explained in the article.

Let's keep watching.

MaxGiting commented 6 years ago

I have seen FB use the the header HTTP_X_PURPOSE with a value of preview I have also seen other requests with the value of prefetch for HTTP_PURPOSE.

I think it is time to review #218 I think it will catch all FB preview/prefetching and possibly some other stuff as well.

andreladocruz commented 6 years ago

Folks,

Filtering by the FB_IAB is too aggressive too. =(

@MaxGiting, to you have any info about the header HTTP_X_PURPOSE.

I want to give it a try here.

andreladocruz commented 6 years ago

@MaxGiting,

Just found this article from Facebook:

https://www.facebook.com/business/help/1514372351922333

To help mitigate this, Facebook has adopted the standard industry practice of including a header 'X-Purpose:preview' in these requests so that publishers and third-party, tag-based measurement solutions can distinguish between prefetched clicks and normal requests or clicks.

With so, I think we must avoid this kind of traffic =)

andreladocruz commented 6 years ago

Another documentation about this:

http://inchoo.net/dev-talk/mitigating-facebook-x-fb-http-engine-liger/

gplumb commented 6 years ago

Perhaps it's worth introducing a method called isPreviewCrawler() that checks for the existence of either a HTTP_PURPOSE or HTTP_X_PURPOSE header with a value of "preview" so that users can decide what to do in those instances themselves - One use case I quite like the idea of is being able to serve a "teaser" image/page to facebook (to encourage a user to click through) :-)

That obviously doesn't solve the philosophical question of what to do about FB_IAB - my two cents' worth is that since it represents a browser, why not promote it to one in Exclusions.php?

Thoughts?

JayBizzle commented 6 years ago

Thanks for your input @gplumb

This is definitely still on our radar, but we just haven't had any time to put any effort into it yet.

I don't mind the idea of adding new methods, but i kinda wished from the start we had grouped bots into types so we could have done stuff like ->isSearchBot(), ->isScraper() or things like that.

Unsure how tricky it would be now to go back through all regexes and categorise them.

gplumb commented 6 years ago

How about we move FB_IAB into Exclusions.php for now (since it's a browser and should be ignored) and we can see about adding the .isPreview(), .isScraper(), etc... methods in another commit

Other than .isPreview, what other categories of bot/crawler/agent would you suggest?

andreladocruz commented 6 years ago

I just got one more thing...

Facebook ads robots...

This robots are used to validate new ads.

looking into my logs I found some IPV6 like this:

"2a03:2880:1030:5fc6:face:b00c:0:8000"

gplumb commented 6 years ago

Love the vanity IP :-) Do you have any headers in your logs as well?

andreladocruz commented 6 years ago

@gplumb, yes... as all my traffic passes by CloudFlare, some are from them.

{
   "data":{
      "cf-connecting-ip":[
         "2a03:2880:30:3fe9:face:b00c:0:8000"
      ],
      "cookie":[
         ""
      ],
      "referer":[
         "https:\/\/l.facebook.com\/l.php?u=https%3A%2F%2Fclkdmg.site%2Fcampaign%2F8a3afac7-2a33-4c68-a23e-c5c7eb1e08f8%3Futm_campaign%3DTOP001%26utm_content%3DAD01&h=ATNikccUgGOCS_5h8UdFHOqY9vIshk_-7W-LJtvaJBE5PWSCKHzuRknbKdF8Rg9frdftNspzfCUzuTwGtPL-NJPjJpagobMlUdZxB6LlbKxgb_wy8j3TbX0aerZqPA"
      ],
      "accept-language":[
         "en-US"
      ],
      "accept":[
         "text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,image\/apng,*\/*;q=0.8"
      ],
      "user-agent":[
         "Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/58.0.3029.110 Safari\/537.36 Edge\/16.16299"
      ],
      "upgrade-insecure-requests":[
         "1"
      ],
      "cf-visitor":[
         "{\"scheme\":\"https\"}"
      ],
      "cf-ray":[
         "3f48d3bd0bd779a9-SEA"
      ],
      "cf-ipcountry":[
         "US"
      ],
      "accept-encoding":[
         "gzip"
      ],
      "connection":[
         "upgrade"
      ],
      "x-nginx-proxy":[
         "true"
      ],
      "host":[
         "clkdmg.site"
      ],
      "x-forwarded-proto":[
         "https"
      ],
      "x-forwarded-for":[
         "2a03:2880:30:3fe9:face:b00c:0:8000, 162.158.106.44"
      ],
      "x-real-ip":[
         "162.158.106.44"
      ],
      "content-length":[
         ""
      ],
      "content-type":[
         ""
      ]
   }
}

this is an example.

gplumb commented 6 years ago

Thanks for sharing this!

It's interesting that this bot uses a seemingly legitimate "user-agent" header. Any work to mark this as a bot will have to operate using the "x-forwarded-for" header (as this is more common in routers than "x-real-ip" - although YMMV).

There are two ways to do this - the "right" way - and the "better" way :-)

The hacky way is to add "x-forwarded-for" to headers.php and add "2a03:2880:30:3fe9:face:b00c:0:8000" to crawlers.php. You could add the IP4 address, but the IP6 address looks more conclusive to me (since there's plenty of those to go around it's unlikely that this vanity address will be re-used by anyone else). Of course, this approach won't stop any malicious bot from pretending to be this one - but that's a different story.

The "proper" way would be to classify all the bots (see higher up in this thread), but that's going to take a bit longer to get done.

@JayBizzle Any thoughts on bot classification? I'd be happy to help you re-classify the existing crawlers...

JayBizzle commented 6 years ago

Not put much though into the different type of classification as it is not a feature we require in the main app we use this library in.

We are open to suggestions/example PRs of how this could be incorporated! 👍

JayBizzle / Crawler-Detect

Facebook Ads Robot #238