contao / core

Contao 3 → see contao/contao for Contao 4
GNU Lesser General Public License v3.0
492 stars 213 forks source link

The ability to disable displaying features if a visitor is a Search engine bot #532

Closed ghost closed 12 years ago

ghost commented 12 years ago

For example: The built in PDF generator and the custom one I wrote, it would be nice to not generate the link if the visitor is a bot and conserver server load. For me, no need for google to generate a pdf version of the site page.

Also, from that it would be nice if the core had a function like IsBot to use in extensions:


function isBot()
{
  $botlist = array("Teoma", "alexa", "froogle", "inktomi", "looksmart", "URL_Spider_SQL", "Firefly", "NationalDirectory", "Ask Jeeves", "TECNOSEEK", "InfoSeek", "WebFindBot", "girafabot", "crawler", "www.galaxy.com", "Googlebot", "Scooter", "Slurp", "appie", "FAST", "WebBug", "Spade", "ZyBorg", "rabaz"); 

  foreach($botlist as $bot)  
    if(ereg($bot, $HTTP_USER_AGENT))
          return True;

  return False; 
}

--- Originally created by bumforaliving on February 19th, 2009, at 04:40pm (ID 532)

leofeyer commented 12 years ago

Generating a different output if the visitor is a search engine is known as Cloaking and it is one of the worst things you can do - except if you are trying to get kicked out the Google index. No way that we add this to the core!

--- Originally created on February 19th, 2009, at 05:57pm

ghost commented 12 years ago

Wow, in no way did I mean it to do that! However I can see how such a feature could be abused though.

The PDF generator consumes a lot of CPU time and I expect to have a TON of news items as time goes on. I have the PDF link on the news viewer. Am I being over concerned or can this cause serious server lag when a bot is on the site?

--- Originally created by bumforaliving on February 20th, 2009, at 03:53pm

leofeyer commented 12 years ago

Why don't you just modify the template and add a rel="nofollow" to the PDF links?

--- Originally created on February 20th, 2009, at 04:35pm

ghost commented 12 years ago

Because that would be too simple of a way to do it? =)

Did not know of that. Researching now.

Thanks Leo!

--- Originally created by bumforaliving on February 20th, 2009, at 05:10pm

ghost commented 12 years ago

From what I am finding, it appears that that does not stop the search engines from crawling links.

From here: http://en.wikipedia.org/wiki/Nofollow

"The nofollow attribute value is not meant for blocking access to content, or for preventing content to be indexed by search engines. The proper methods for blocking search engine spiders to access content on a website or for preventing them to include the content of a page in their index are the Robots Exclusion Standard (robots.txt) for blocking access and on-page Meta Elements that are designed to specify on an individual page level what a search engine spider should or should not do with the content of the crawled page."

Is this something the bots do anyway?

Meta Elements as I understand would set nofollow to all links on a page and using the robots.txt, since wild cards are not supported, I'm not sure how I could block (or format so I can) something like:

http://tl/news-reader/items/test.165.html?html2pdf=45

--- Originally created by bumforaliving on February 20th, 2009, at 05:40pm

leofeyer commented 12 years ago

--- Originally closed on February 19th, 2009, at 05:57pm