Letractively / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
0 stars 0 forks source link

Combine filters and enhance filter implementation #7

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Currently we have 2 types of filters, namely URL filter and Server filter.
Both of these are not really required, since they essentially do a similar
job of filtering out URL paths by matching specific parts.

Also the implementation uses regexp grammar and there is no clear
documentation on how to make filters. This should be replaced with
pyparsing grammar for more effective and error-free filter implementation.
Also documentation should be updated.

Original issue reported on code.google.com by abpil...@gmail.com on 25 Jun 2008 at 12:25

GoogleCodeExporter commented 8 years ago
Started work on this. Expected time of finishing - 3-4 man days. Not sure how 
much
time I will get on this the week. 

Original comment by abpil...@gmail.com on 13 Jul 2008 at 8:35

GoogleCodeExporter commented 8 years ago
The filters will be changed to the following single filtering mechanism.
There will be a single <filter> element which will enclose specific filters.

There will be two distinct type of filters, namely "urlfilter"
and "textfilter". The former filter will work on URLs and latter
on text content of pages. 

The "urlfilter" can consist of three types of filters.

I. URL filters

1. A regular expression filter which allows one to specify a regular
expression to filter out URLs. These work just like regular expressions. 
2. A URL "path" filter which allows to specify parts of a URL as a filter.
This works as follows.

A path is any part of a URL. It could be a part or a complete URL
or its beginning or front. It allows wildcards by using "*".

Example: -/images/private/
     - Exclude any URL matching /images/private anywhere in the URL

         -/images/*+/images/public/*

    - Exclude any URL with /images/, but
      allow any URL which has /images/public in it. This would block the
      following URLs.

      http://www.foo.com/images/image1.jpg
      http://www.foo.com/images/image2.jpg
      http://www.foo.com/images/image3.jpg

      But it will allow the following URLs,

      http://www.foo.com/images/public/pub1.jpg
      http://www.foo.com/images/public/pub2.jpg

3. A file extension filter which allows to specify file extensions of URLs
as the basis to be filtered.

This filter is quite simple, allowing to specify file extensions to 
be filtered. 

Example:

   "-jpg,-png" or
   "-jpg -png"

Means - filter out URLs with extensions .jpg and .png. This will filter
out most JPEG and PNG images. The "." are not required in the filter
but a comma or space is required between the subsequent extensions.

These filters are titled "regexp" "path" and "extension" respectively.

All filters will follow the existing syntax of exclude using "-" prefix
and include using "+" prefix. Further, filters can be tied together in the same
string by using + and - chars.

If a - or + prefix is not found, it is assumed that the filter specifies
an exclusion, i.e URLs matching the filter are filtered out. However
regex filters will *not* support the + or - prefix since it would be
difficult to parse such filters as + or - can be part of the regular
expression itself. Regex filters hence are always considered to be exclusion
filters (If you want a regex filter to be inclusion, reverse the regex...)

All filter values will be specified as the "value" attribute of the
elements. Also filters will act cumulatively. A URL is checked to match
any of the filters and action taken accordingly. There is no way
to specify a relation between filters, i.e something like (filterA and filterB)
or filterC. It is always an OR relation, i.e 

if match(filterA) or filterB or filterC:
    # Take action
    ...

However you can disable a filter by settings its "enable" attribute to zero.

Here are samples of filters. 

<urlfilter>
  <regexp value="(\s*\/banner\/)" enable="1" />
  <path value="-/images/*+/images/public/*" enable="1" />
  <extension value="-jpg,-png" enable="1" />
</urlfilter>

The above means that a URL is tried on each of the above as a filter.
If *any* of it match, appropriate  action is taken. Only if none match
is the URL unfiltered.

By default all filters are "crawl" filters. Which means that the URLs are
filtered out before entering them into the crawl URL queue, immediately after
parsing the URLs from a parent URL's web-page. However sometimes one wants
"download" filters, i.e you woud want to fetch and parse these URLs and
find out their child URLs, but only apply the filter as a download filter.

This is possible by specifying the attribute "crawl" in the <urlfilter>
element. By default this is "0", but if it is set to "1", URLs are not
checked for filter at the time of crawling, but only at the time of saving
them to disk.

However, this is possible only on the entire set of <urlfilter> types, not
on each individual element. 

Example:

<urlfilter crawl="1" >
  <path value="-/images/*+/images/public/*" enable="1" />
</urlfilter>

The above filter means that any checking is done only at the file-saving
time and not during crawling. N

NOTE: Note that crawl attribute makes sense only for web-page 
(HTML) filters and hence don't use it if your filters
only match non-webpage URLs like images, documents (PDF) etc. 
For non-webpage URLs, filters work only as download filters 
anyway.

NOTE: By default regex filters are UNICODE filters. Regex filters
take additional argument called "flags" which allows one to pass
additional flags taken by Python's re module to them. For example,
to specify a filter to match the current locale,

  <regexp value="(\s*\/banner\/)" enable="1" flags="re.LOCALE" />

Also, all filters are case-insensitive by default.  To enable
case-sensitivity all filters take a "case" attribute which is "0"
by default. To enable casing, set it to 1.

Example:

<urlfilter>
  <regexp value="(\s*\/banner\/)" enable="1" case="1" />
  <path value="-/images/*+/images/public/*" enable="1" case="1" />
  <extension value="-jpg,-png" enable="1" case="1" />
</urlfilter>

NOTE: There is no top-level "case" attribute. So casing has to
be set individually for each filter if required.

2. Text Filters

Text-filters work on the URL content, title, keywords and description.
These filters accept only regular expressions and is always an
exclude filter, i.e there is no way to specify an inclusion or exclusion
in the filter text by using a + or -. Any such logic has to be part
of the regular expression itself.

Text-filters are of two types.

1. Content-filter - This works only on the body of the web-page, excluding
the HTML tags. It is a regular expression filter. The element name is
"content".

Example:

   <content value="(Python\s+Perl)" flags="re.MULTILINE" />

2. metafilter - This works on the content of the tags "title", "keywords" and
"description" (as of now, more tags could be added later). It takes a "tags"
attribute which by default is set to "all" which means that the filter will
be applied to any of these tags, looking for a match. To specify specific tags
change the value of this attribute. It accepts 'OR'ing and 'AND'ing of tags
by using "|" and "&" respectively. The element name is "meta".

Examples:

  <meta value="web-bot" />
  <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords|description"/>
  <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords&description"/>

The 2nd filer will apply if the regex matches content of either the "keywords" 
or
"description" tags, but the last will apply only if it matches both of them.

NOTE: For "keywords" the regexp is applied to every item in the keyword list
separately, not to the entire string. For example if the "keywords" is,

<meta name="keywords" content="crawler, spider, bot, web-bot, robot" />

Then the keywords regexp is applied to each item of the list
["crawler","spider","bot","web-bot","robot"] separately. If any match the filter
is assumed to have matched "keywords" tag.

NOTE: If you want separate regexp for these tags, do as follows.

 <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords"/>
 <meta value="internet|crawler|web-bot|web-crawler" tags="description"/>
 <meta value="harvestman|web-crawler" tags="title"/>

Here is a complete example of a text-filter.

<textfilter>
 <content value="(Python\s+Perl)" flags="re.MULTILINE" />
 <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords"/>
 <meta value="internet|crawler|web-bot|web-crawler" tags="description"/>
 <meta value="harvestman|web-crawler" tags="title"/>
</textfilter>

NOTE: Text-filters also accept the "case" attribute per filter.

That is all.

Please give feedbacks!

Original comment by abpil...@gmail.com on 21 Jul 2008 at 5:08

GoogleCodeExporter commented 8 years ago

Original comment by abpil...@gmail.com on 21 Jul 2008 at 5:10

GoogleCodeExporter commented 8 years ago
For information on existing filters, see
http://harvestmanontheweb.com/faq.html#toc39 . Some of the
new design has been borrowed from this like path-filter,
extension filer and use of + and  - for example.

Original comment by abpil...@gmail.com on 21 Jul 2008 at 5:11

GoogleCodeExporter commented 8 years ago
Maybe the logical conditions between the branches should be stated to increase 
the 
flexibility:

Example:

<textfilter>
   <OR>
     <content value="(Python\s+Perl)" flags="re.MULTILINE" />
     <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords"/>
     <AND>
       <meta value="internet|crawler|web-bot|web-crawler" tags="description"/>
       <meta value="harvestman|web-crawler" tags="title"/>
     </AND>
  </OR>
</textfilter>

Original comment by andrei.p...@gmail.com on 21 Jul 2008 at 9:51

GoogleCodeExporter commented 8 years ago
Hi andrei,

   It is a full OR condition, I am not supporting anything else. So,
<textfilter>
     <content value="(Python\s+Perl)" flags="re.MULTILINE" />
     <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords"/>
      <meta value="internet|crawler|web-bot|web-crawler" tags="description"/>
     <meta value="harvestman|web-crawler" tags="title"/>
</textfilter>

means,

content OR meta1 OR meta2 OR meta3.

Now since contentfilter is done on the page content the first OR does not 
matter.

I also forgot to mention that the text filter will also support crawl/download
mode, so if you only want to skip writing and not crawling, set the "crawl" flag
of the filter to 1. 

i.e

<textfilter crawl="1">
     <content value="(Python\s+Perl)" flags="re.MULTILINE" />
     <meta value="(web-bot|crawler|robot|web-crawler)" tags="keywords"/>
      <meta value="internet|crawler|web-bot|web-crawler" tags="description"/>
     <meta value="harvestman|web-crawler" tags="title"/>
</textfilter>

The main difference with urlfilter and content filter is that URL filters when
employed in  default mode (crawl==0), will mean the page itself is not fetched 
if the
filter matches (if it does not match or if it is an include filter, it will be
fetched and content filters applied on it). However content filters always are
applied on the downloaded content which means that content has to be always 
fetched.
But by default (crawl==0) the page is skipped if the filter matches the content 
-
this means that the page is not crawled for child links and it is not saved 
either.
However if crawl==1, then first happens - the page is crawled for child links, 
but
not saved.

In fact the complete filtering mechanism proposed here can be implemented in a 
custom
crawler by overriding specific events and plugging in your regexes there (see
samples/searchingcrawler for an example), but the filter mechanism provides a 
generic
way of doing it without needing to write additional code.

Finally a word about include patterns using "+" prefix. This has to be used 
with care
and is most useful to specialize an exclude pattern. For example,

 -images/*+images/public/* 

allows to override the default behavior of the exclude pattern (-/images/*) 
with a
specific include pattern (+images/public/*) so that everything which matches the
first pattern is filtered out except those which match the 2nd pattern.

However if no exclude pattern precedes an include pattern it will result in 
specific
inclusion, i.e with the effect that everything else is skipped. So,

  +/images/public/ will result in *all* URLs except those which match this pattern to
be filtered out. Similarly,

  <extension value="-jp*+jpeg" />

will cause all extensions ending with ".jp<anythingelse>" to be filtered out 
except
those ending with ".jpeg". However,

  <extension value="+jpeg" /> has a different meaning - it assigns "exclusive
inclusivity" to the "jpeg" extension which means *everything else* (including 
html)
will be filtered out and not saved.

 I am documenting issues heavily so we can use the stuff here to write Wiki
documentation later...

Original comment by abpil...@gmail.com on 21 Jul 2008 at 10:08

GoogleCodeExporter commented 8 years ago
I agree this should be added to the wiki with the examples also (small and 
meaningfull). 

Original comment by andrei.p...@gmail.com on 21 Jul 2008 at 1:09

GoogleCodeExporter commented 8 years ago
Ok, this is the task I am working on currently and next in line for the feature 
list
work for the crawler.

I think I need to write it here so that I don't waver to other stuff :)

Original comment by abpil...@gmail.com on 12 Oct 2008 at 10:38

GoogleCodeExporter commented 8 years ago
Completed urlfilter implementation with unit-tests and integrated with 
HarvestMan.
Text filter is pending. I plan to complete this tomorrow. Expecting to finish 
off
this issue within 2 days.

Original comment by abpil...@gmail.com on 16 Nov 2008 at 7:54

GoogleCodeExporter commented 8 years ago
Completed. Implemented text filter today. So closing this issue.

Original comment by abpil...@gmail.com on 12 Jan 2009 at 8:19

GoogleCodeExporter commented 8 years ago
There are 2 changes from the spec above in the implementation.

1. Did not implement the "crawl" attribute as above. So all filters are crawl
filters. I don't think there is much use for download only filters. It is also 
a bit
tricky to do in the current code.
2. Meta filters accept "tags" attribute only in OR mode. Also it is specified 
using
commas to separate tags not using "&" or "|" as above. For example
"tags='keywords,description'" will return True if the filter matches either 
keywords
or description in the page.

Original comment by abpil...@gmail.com on 12 Jan 2009 at 8:21