Informacje nt. różnych crawlerów, ich działania, wytyczne itd.

Web Crawler

-----

1. indeksowanie głębokie i indeksowanie płytkie (podobny jest punkt 11.)

a) Indeksowanie głębokie ma na celu pobranie większej ilości stron i stron 
znajdujących się głębiej w strukturze witryny. Roboty przy głębokim 
indeksowaniu są w stanie podążyć za dużą ilością odnośników, aby 
dotrzeć do wszystkich treści. Takie indeksowanie odbywa się rzadziej niż 
indeksowanie płytkie.

b) Indeksowanie płytkie ma na celu odwiedzenie stron najpopularniejszych, 
najczęściej aktualizowanych lub dokumentów, do których prowadzi najwięcej 
linków zwrotnych. Może to być strona główna, dokument zawierający 
aktualności lub strona z popularnym narzędziem. Indeksowaniem płytkim 
wyszukiwarka dokłada starań, aby jej indeks zawierał aktualne wersje 
popularnych dokumentów. Robot nie podąża za dużą ilością linków i nie 
sprawdza mniej istotnych elementów witryny. Ten typ indeksowania występuje 
częściej niż indeksowanie głębokie. 

-----

2. Narzędzie crawl – front-end do narzędzi niższego poziomu:

a)

1. Utwórz WebDB (admin db – create)
2. Dodaj ziarno adresów UR to WebDB (inject)
3. Generuj listę adresów stron do pobrania na podstawie WebDB (generate)
4. Pobierz zawartość stron pod wskazanymi adresami URL (fetch)
5. Uaktualnij WebDB linkami z pobranych stron (updatedb)
6. Powtarzaj kroki 3-5, dopóki nie zostanie osiągnięta zadana głębokość.
7. Uaktualnij segmenty na podstawie WebDB (updatesegs)
8. Indeksuj pobrane strony (index)
9. Eliminuj duplikaty (zawartość i adresy) z indeksu (dedup)
10. Połącz indeksy w jeden duży indeks dla celów przeszukiwania (merge)

b) Wielowątkowość (inne zrodlo):

http://mtz.fc.pl/crawler/crawler1.jpg

-----

3. Przykładowy prosty kod crawlera ze strony java.com i opis:

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler
.java

http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

(Szczerze to próbówałem na kilka sposobów skompilowac kod ale bez 
pozytywnego skutku)

-----

4. Heritrix to robot indeksujący, archwizujący i analizujący to co jest 
dostępne w internecie.

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix

Heritrix domyślnie zapisuje źródła strony w plikach Arc (nie jest to 
powiązane z formatem kompresji). Mozna go odpowiednio dostosowac, tak aby 
zapisywal kod podobnie jak crawler Wget ktory uzywa linka URL jako nazwy 
katalogu, a zasoby strony jako nazwy plikow.

Heritrix comes with several command-line tools:

htmlextractor - displays the links Heritrix would extract for a given URL
hoppath.pl - recreates the hop path (path of links) to the specified URL from a 
completed crawl
manifest_bundle.pl - bundles up all resources referenced by a crawl manifest 
file into an uncompressed or compressed tar ball
cmdline-jmxclient - enables command-line control of Heritrix
arcreader - extracts contents of ARC files (see above)

-----

5. Nutch (taki Heritrix z zintegrowaną wyszukiwarką)

http://nutch.apache.org/

Mozliwosci:

Fetching and parsing are done separately by default, this reduces the risk of 
an error corrupting the fetch parse stage of a crawl with Nutch.
Plugins have been overhauled as a direct result of removal of legacy Lucene 
dependency for indexing and search.
The number of plugins for processing various document types being shipped with 
Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), 
Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all 
now parsed by the Tika plugin. The only parser plugins shipped with Nutch now 
are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
MapReduce ;
Distributed filesystem (via Hadoop)
Link-graph database
NTLM authentication

-----

6. WebSPHINX (latest release v0.5, July 8, 2002)

http://www.cs.cmu.edu/~rcm/websphinx/

Oferuje 2 części:

- Workbench - graficzny interfejs uzytkownika pozwalajacy konfigurowac i 
kontrolowac crawlera.
uzywany do: prezentowania stron jako graf, zapisywania stron na dysku do 
korzystania offline, zbierania calego tekstu pasujacego do naszego zapytania ze 
stron, mozna takze dopisac odpowiedni kod aby crawler spelnial nasze wymagania 
(java lub js)

- Biblioteka klas - wspieraja pisanie crawlerow poprzez nastepujace funkcje:

Multithreaded Web page retrieval in a simple application framework
An object model that explicitly represents pages and links
Support for reusable page content classifiers
Tolerant HTML parsing
Support for the robot exclusion standard
Pattern matching, including regular expressions, Unix shell wildcards, and HTML 
tag expressions. Regular expressions are provided by the Apache jakarta-regexp 
regular expression library.
Common HTML transformations , such as concatenating pages , saving pages to 
disk, and renaming links

-----

7. Ex-Crawler (released on 2010-06-10)

Mozliwosci:
http://ex-crawler.sourceforge.net/joomla/index.php/en/home

Czyli crawler, wyszukiwarka, posiada wlasny serwer, zapisuje dane do baz MySQL, 
PostgreSQL lub MSSQL, posiada mozliwosc dodawania (wlasnych) wtyczek, posiada 
graficzny interface oraz szablony. 

-----

8. Crawler4j (Feb 2012)

http://code.google.com/p/crawler4j/

Mozliwosc wielowatkowosci, rozbudowany crawler, lecz prosty w obsludze.
Zdolny zbierac linki obrazkow, urli, statystyki itd.

-----

9. More open source crawlers:

http://java-source.net/open-source/crawlers

-----

10. Przykladowy pseudokod crawlera:

Ask user to specify the starting URL on web and file type that crawler should 
crawl. 

Add the URL to the empty list of URLs to search. 

While not empty ( the list of URLs to search )
{

    Take the first URL in from the list of URLs
    Mark this URL as already searched URL.

    If the URL protocol is not HTTP then
        break;
        go back to while

    If robots.txt file exist on site then
        If file includes .Disallow. statement then
            break;
            go back to while

    Open the URL

    If the opened URL is not HTML file then
        Break;
        Go back to while 

    Iterate the HTML file

    While the html text contains another link {

        If robots.txt file exist on URL/site then
            If file includes .Disallow. statement then
                break;
                go back to while

        If the opened URL is HTML file then
            If the URL isn't marked as searched then
                Mark this URL as already searched URL.

            Else if type of file is user requested
            Add to list of files found.

    }
  }

Inny przyklad pseudokodu:

Get the user's input: the starting URL and the desired 
 file type. Add the URL to the currently empty list of 
 URLs to search. While the list of URLs to search is 
 not empty,
  {
    Get the first URL in the list.
    Move the URL to the list of URLs already searched.
    Check the URL to make sure its protocol is HTTP 
       (if not, break out of the loop, back to "While").
    See whether there's a robots.txt file at this site 
      that includes a "Disallow" statement.
      (If so, break out of the loop, back to "While".)
    Try to "open" the URL (that is, retrieve
     that document From the Web).
    If it's not an HTML file, break out of the loop,
     back to "While."
    Step through the HTML file. While the HTML text 
       contains another link,
    { 
       Validate the link's URL and make sure robots are 
    allowed (just as in the outer loop).
     If it's an HTML file,
       If the URL isn't present in either the to-search 
       list or the already-searched list, add it to 
       the to-search list.
         Else if it's the type of the file the user 
         requested,
            Add it to the list of files found.
    }
  }

-----

11. Algorithm types

a) Path-ascending crawling

We intend the crawler to download as many resources as possible from a 
particular Web site. That way a crawler would ascend to every path in each URL 
that it intends to crawl. For example, when given a seed URL of 
http://foo.org/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.

The advantage with Path-ascending crawler is that they are very effective in 
finding isolated resources, or resources for which no inbound link would have 
been found in regular crawling.

b) Focused crawling

The importance of a page for a crawler can also be expressed as a function of 
the similarity of a page to a given query. In this strategy we can intend web 
crawler to download pages that are similar to each other, thus it will be 
called focused crawler or topical crawler.

The main problem in focused crawling is that in the context of a Web crawler, 
we would like to be able to predict the similarity of the text of a given page 
to the query before actually downloading the page. A possible predictor is the 
anchor text of links; to resolve this problem proposed solution would be to use 
the complete content of the pages already visited to infer the similarity 
between the driving query and the pages that have not been visited yet. The 
performance of a focused crawling depends mostly on the richness of links in 
the specific topic being searched, and a focused crawling usually relies on a 
general Web search engine for providing starting points.

-----

12. Przydatne wskazowki:

* convert the protocol and hostname to lowercase. For example, HTTP://
www.UIOWA.edu is converted to http://www.uiowa.edu.
* remove the `anchor' or `reference' part of the URL. Hence, http://
myspiders.biz.uiowa.edu/faq.html#what is reduced to
http://myspiders.biz.uiowa.edu/faq.html.
* perform URL encoding for some commonly used characters such as `~'.
This would prevent the crawler from treating http://dollar.biz.uiowa.
edu/~pant/ as a different URL from http://dollar.biz.uiowa.edu/
%7Epant/.
* for some URLs, add trailing `/'s. http://dollar.biz.uiowa.edu and
http://dollar.biz.uiowa.edu/ must map to the same canonical form.
The decision to add a trailing `/' will require heuristics in many cases.
* use heuristics to recognize default Web pages. File names such as
index.html or index.htm may be removed from the URL with the as-
sumption that they are the default ¯les. If that is true, they would be
retrieved by simply using the base URL.
* remove `..' and its parent directory from the URL path. Therefore, URL
path /%7Epant/BizIntel/Seeds/../ODPSeeds.dat is reduced to
/%7Epant/BizIntel/ODPSeeds.dat.
* leave the port numbers in the URL unless it is port 80. As an alternative,
leave the port numbers in the URL and add port 80 when no port number
is specifed.

Web crawlers typically identify themselves to a Web server by using the 
User-agent field of an HTTP request. Web site administrators typically examine 
their Web servers' log and use the user agent field to determine which crawlers 
have visited the web server and how often. The user agent field may include a 
URL where the Web site administrator may find out more information about the 
crawler. Spambots and other malicious Web crawlers are unlikely to place 
identifying information in the user agent field, or they may mask their 
identity as a browser or other well-known crawler.
It is important for Web crawlers to identify themselves so that Web site 
administrators can contact the owner if needed. In some cases, crawlers may be 
accidentally trapped in a crawler trap or they may be overloading a Web server 
with requests, and the owner needs to stop the crawler. Identification is also 
useful for administrators that are interested in knowing when they may expect 
their Web pages to be indexed by a particular search engine.
Original comment by m.zakrze...@gmail.com on 26 Feb 2012 at 4:45
RoyZeng / kino-crawler

Informacje nt. różnych crawlerów, ich działania, wytyczne itd. #6