borowiak / pwa-technologies

Automatically exported from code.google.com/p/pwa-technologies
0 stars 0 forks source link

use the Google Search API to quickly obtain lists of PT sites for seeds #90

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Problems with using TLD lists for seeds:

- Second level domains are used for many other purposes than hosting
websites, such as email servers or restricted access services. And it is
not rare the case in which there are used for SPAM.

- There is a time gap between domain registration and actually creating
a site for it. Many domains are registered and never get to reference a
site or just present an "Under construction" page. This fact delays the
crawl and causes the storage of low-quality pages, leading to waste of
web archive resources.

- The list of domains only includes the second level names, that is
site.pt. However, many sites are referenced by deeper name levels, such
as subsite.site.pt, and cannot be obtained from the domain registry. For
instance, we crawl a large blog platform with domain blogs.sapo.pt
(third level domain) that hosts thousands of blogs at deeper levels such
as blabla.blogs.sapo.pt. All these blogs cannot be obtained from the
domain list and are also not linked from the TLD sapo.pt.

Complementary action:

Use the Google Search API to quickly obtain lists of sites from .PT:

https://www.google.pt/?gfe_rd=cr&ei=rlVbVP2pPOKkiAad34GwCg#q=site:.pt

Google did part of the selection work for us, by presenting an ordered
list of the most relevant .PT sites and removing most web spam sites.

Original issue reported on code.google.com by danielco...@gmail.com on 6 Nov 2014 at 11:31

GoogleCodeExporter commented 9 years ago
Google Search APi is deprecated since 2010. In September 29 2014 was the last 
day of operation.

To replace this service, google launched the Custom Search Engine.
Is possibly to search the whole web, but it got some limitations and the 
retrieved results are not exactly the same of google search engine.

The free usage of the API just let us retrieve 100 results and 100 querys/day. 
The problem is that the query site:.pt returns more than 100 results and google 
dont let us retrieve results from a index bigger than 100 (because the 100 
result limit)

At the moment, this free API is good to get relevant sites from a specific 
topic, but not for a so broader scope.

We can try other search providers.

Original comment by daniel.b...@gmail.com on 18 Dec 2014 at 11:15

Attachments:

GoogleCodeExporter commented 9 years ago
Not possible with current google API limitations.

Original comment by daniel.b...@gmail.com on 18 Dec 2014 at 1:46