Open GoogleCodeExporter opened 9 years ago
[deleted comment]
Web Crawler
-----
1. indeksowanie głębokie i indeksowanie płytkie (podobny jest punkt 11.)
a) Indeksowanie głębokie ma na celu pobranie większej ilości stron i stron
znajdujących się głębiej w strukturze witryny. Roboty przy głębokim
indeksowaniu są w stanie podążyć za dużą ilością odnośników, aby
dotrzeć do wszystkich treści. Takie indeksowanie odbywa się rzadziej niż
indeksowanie płytkie.
b) Indeksowanie płytkie ma na celu odwiedzenie stron najpopularniejszych,
najczęściej aktualizowanych lub dokumentów, do których prowadzi najwięcej
linków zwrotnych. Może to być strona główna, dokument zawierający
aktualności lub strona z popularnym narzędziem. Indeksowaniem płytkim
wyszukiwarka dokłada starań, aby jej indeks zawierał aktualne wersje
popularnych dokumentów. Robot nie podąża za dużą ilością linków i nie
sprawdza mniej istotnych elementów witryny. Ten typ indeksowania występuje
częściej niż indeksowanie głębokie.
-----
2. Narzędzie crawl – front-end do narzędzi niższego poziomu:
a)
1. Utwórz WebDB (admin db – create)
2. Dodaj ziarno adresów UR to WebDB (inject)
3. Generuj listę adresów stron do pobrania na podstawie WebDB (generate)
4. Pobierz zawartość stron pod wskazanymi adresami URL (fetch)
5. Uaktualnij WebDB linkami z pobranych stron (updatedb)
6. Powtarzaj kroki 3-5, dopóki nie zostanie osiągnięta zadana głębokość.
7. Uaktualnij segmenty na podstawie WebDB (updatesegs)
8. Indeksuj pobrane strony (index)
9. Eliminuj duplikaty (zawartość i adresy) z indeksu (dedup)
10. Połącz indeksy w jeden duży indeks dla celów przeszukiwania (merge)
b) Wielowątkowość (inne zrodlo):
http://mtz.fc.pl/crawler/crawler1.jpg
-----
3. Przykładowy prosty kod crawlera ze strony java.com i opis:
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/WebCrawler
.java
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/
(Szczerze to próbówałem na kilka sposobów skompilowac kod ale bez
pozytywnego skutku)
-----
4. Heritrix to robot indeksujący, archwizujący i analizujący to co jest
dostępne w internecie.
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
Heritrix domyślnie zapisuje źródła strony w plikach Arc (nie jest to
powiązane z formatem kompresji). Mozna go odpowiednio dostosowac, tak aby
zapisywal kod podobnie jak crawler Wget ktory uzywa linka URL jako nazwy
katalogu, a zasoby strony jako nazwy plikow.
Heritrix comes with several command-line tools:
htmlextractor - displays the links Heritrix would extract for a given URL
hoppath.pl - recreates the hop path (path of links) to the specified URL from a
completed crawl
manifest_bundle.pl - bundles up all resources referenced by a crawl manifest
file into an uncompressed or compressed tar ball
cmdline-jmxclient - enables command-line control of Heritrix
arcreader - extracts contents of ARC files (see above)
-----
5. Nutch (taki Heritrix z zintegrowaną wyszukiwarką)
http://nutch.apache.org/
Mozliwosci:
Fetching and parsing are done separately by default, this reduces the risk of
an error corrupting the fetch parse stage of a crawl with Nutch.
Plugins have been overhauled as a direct result of removal of legacy Lucene
dependency for indexing and search.
The number of plugins for processing various document types being shipped with
Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org),
Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all
now parsed by the Tika plugin. The only parser plugins shipped with Nutch now
are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
MapReduce ;
Distributed filesystem (via Hadoop)
Link-graph database
NTLM authentication
-----
6. WebSPHINX (latest release v0.5, July 8, 2002)
http://www.cs.cmu.edu/~rcm/websphinx/
Oferuje 2 części:
- Workbench - graficzny interfejs uzytkownika pozwalajacy konfigurowac i
kontrolowac crawlera.
uzywany do: prezentowania stron jako graf, zapisywania stron na dysku do
korzystania offline, zbierania calego tekstu pasujacego do naszego zapytania ze
stron, mozna takze dopisac odpowiedni kod aby crawler spelnial nasze wymagania
(java lub js)
- Biblioteka klas - wspieraja pisanie crawlerow poprzez nastepujace funkcje:
Multithreaded Web page retrieval in a simple application framework
An object model that explicitly represents pages and links
Support for reusable page content classifiers
Tolerant HTML parsing
Support for the robot exclusion standard
Pattern matching, including regular expressions, Unix shell wildcards, and HTML
tag expressions. Regular expressions are provided by the Apache jakarta-regexp
regular expression library.
Common HTML transformations , such as concatenating pages , saving pages to
disk, and renaming links
-----
7. Ex-Crawler (released on 2010-06-10)
Mozliwosci:
http://ex-crawler.sourceforge.net/joomla/index.php/en/home
Czyli crawler, wyszukiwarka, posiada wlasny serwer, zapisuje dane do baz MySQL,
PostgreSQL lub MSSQL, posiada mozliwosc dodawania (wlasnych) wtyczek, posiada
graficzny interface oraz szablony.
-----
8. Crawler4j (Feb 2012)
http://code.google.com/p/crawler4j/
Mozliwosc wielowatkowosci, rozbudowany crawler, lecz prosty w obsludze.
Zdolny zbierac linki obrazkow, urli, statystyki itd.
-----
9. More open source crawlers:
http://java-source.net/open-source/crawlers
-----
10. Przykladowy pseudokod crawlera:
Ask user to specify the starting URL on web and file type that crawler should
crawl.
Add the URL to the empty list of URLs to search.
While not empty ( the list of URLs to search )
{
Take the first URL in from the list of URLs
Mark this URL as already searched URL.
If the URL protocol is not HTTP then
break;
go back to while
If robots.txt file exist on site then
If file includes .Disallow. statement then
break;
go back to while
Open the URL
If the opened URL is not HTML file then
Break;
Go back to while
Iterate the HTML file
While the html text contains another link {
If robots.txt file exist on URL/site then
If file includes .Disallow. statement then
break;
go back to while
If the opened URL is HTML file then
If the URL isn't marked as searched then
Mark this URL as already searched URL.
Else if type of file is user requested
Add to list of files found.
}
}
Inny przyklad pseudokodu:
Get the user's input: the starting URL and the desired
file type. Add the URL to the currently empty list of
URLs to search. While the list of URLs to search is
not empty,
{
Get the first URL in the list.
Move the URL to the list of URLs already searched.
Check the URL to make sure its protocol is HTTP
(if not, break out of the loop, back to "While").
See whether there's a robots.txt file at this site
that includes a "Disallow" statement.
(If so, break out of the loop, back to "While".)
Try to "open" the URL (that is, retrieve
that document From the Web).
If it's not an HTML file, break out of the loop,
back to "While."
Step through the HTML file. While the HTML text
contains another link,
{
Validate the link's URL and make sure robots are
allowed (just as in the outer loop).
If it's an HTML file,
If the URL isn't present in either the to-search
list or the already-searched list, add it to
the to-search list.
Else if it's the type of the file the user
requested,
Add it to the list of files found.
}
}
-----
11. Algorithm types
a) Path-ascending crawling
We intend the crawler to download as many resources as possible from a
particular Web site. That way a crawler would ascend to every path in each URL
that it intends to crawl. For example, when given a seed URL of
http://foo.org/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.
The advantage with Path-ascending crawler is that they are very effective in
finding isolated resources, or resources for which no inbound link would have
been found in regular crawling.
b) Focused crawling
The importance of a page for a crawler can also be expressed as a function of
the similarity of a page to a given query. In this strategy we can intend web
crawler to download pages that are similar to each other, thus it will be
called focused crawler or topical crawler.
The main problem in focused crawling is that in the context of a Web crawler,
we would like to be able to predict the similarity of the text of a given page
to the query before actually downloading the page. A possible predictor is the
anchor text of links; to resolve this problem proposed solution would be to use
the complete content of the pages already visited to infer the similarity
between the driving query and the pages that have not been visited yet. The
performance of a focused crawling depends mostly on the richness of links in
the specific topic being searched, and a focused crawling usually relies on a
general Web search engine for providing starting points.
-----
12. Przydatne wskazowki:
* convert the protocol and hostname to lowercase. For example, HTTP://
www.UIOWA.edu is converted to http://www.uiowa.edu.
* remove the `anchor' or `reference' part of the URL. Hence, http://
myspiders.biz.uiowa.edu/faq.html#what is reduced to
http://myspiders.biz.uiowa.edu/faq.html.
* perform URL encoding for some commonly used characters such as `~'.
This would prevent the crawler from treating http://dollar.biz.uiowa.
edu/~pant/ as a different URL from http://dollar.biz.uiowa.edu/
%7Epant/.
* for some URLs, add trailing `/'s. http://dollar.biz.uiowa.edu and
http://dollar.biz.uiowa.edu/ must map to the same canonical form.
The decision to add a trailing `/' will require heuristics in many cases.
* use heuristics to recognize default Web pages. File names such as
index.html or index.htm may be removed from the URL with the as-
sumption that they are the default ¯les. If that is true, they would be
retrieved by simply using the base URL.
* remove `..' and its parent directory from the URL path. Therefore, URL
path /%7Epant/BizIntel/Seeds/../ODPSeeds.dat is reduced to
/%7Epant/BizIntel/ODPSeeds.dat.
* leave the port numbers in the URL unless it is port 80. As an alternative,
leave the port numbers in the URL and add port 80 when no port number
is specifed.
Web crawlers typically identify themselves to a Web server by using the
User-agent field of an HTTP request. Web site administrators typically examine
their Web servers' log and use the user agent field to determine which crawlers
have visited the web server and how often. The user agent field may include a
URL where the Web site administrator may find out more information about the
crawler. Spambots and other malicious Web crawlers are unlikely to place
identifying information in the user agent field, or they may mask their
identity as a browser or other well-known crawler.
It is important for Web crawlers to identify themselves so that Web site
administrators can contact the owner if needed. In some cases, crawlers may be
accidentally trapped in a crawler trap or they may be overloading a Web server
with requests, and the owner needs to stop the crawler. Identification is also
useful for administrators that are interested in knowing when they may expect
their Web pages to be indexed by a particular search engine.
Original comment by m.zakrze...@gmail.com
on 26 Feb 2012 at 4:45
http://www.cs.put.poznan.pl/mkadzinski/ezi/dzienne/lab10/Lab10.pdf
Original comment by m.zakrze...@gmail.com
on 5 Mar 2012 at 1:51
Original issue reported on code.google.com by
m.zakrze...@gmail.com
on 25 Feb 2012 at 3:38