Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
180 stars 68 forks source link

Question: crawling in similar domain #612

Open FcrbPeter opened 5 years ago

FcrbPeter commented 5 years ago

Hi Pascal,

I am working on a website which include different domains, such as...

// Below are the domains in the start url section
www.rthk.hk
app3.rthk.hk
app4.rthk.hk
programme.rthk.hk
news.rthk.hk
podcast.rthk.hk
// Below are the domains that need to crawl but not listed above
app1.rthk.hk
app2.rthk.hk
... more with "rthk.hk"

In the config.xml, I did something like...

// stayOnDomain = false, because there would be other similar doamin
// stayOnPort & stayOnProtocol = false, because there are http and https
<startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
<url>http://app3.rthk.hk/search/google/start.php</url>
<url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url>
<url>https://www.rthk.hk/</url>
<url>https://news.rthk.hk/</url>
<url>http://podcast.rthk.hk/</url>
<url>http://app4.rthk.hk/special/rthkmemory/</url>
<url>http://app4.rthk.hk/elearning/healthpedia/</url>
<startURLs>

<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
        .*rthk\.org\.hk/.*
</filter>

... other exclude filters
</referenceFilters>

I found this solution from the past issues. However, it seems not working in my case.

I got the following log which there is a unwanted url got fetched.

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/1d916bc2a14c46e2999138ed408fecb9.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/picture/0/04dd334f9961456586f017a5c44ce7dc.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/../../../../module/visitcount/visit.jsp?type=3&i_webid=2&i_columnid=316&i_articleid=944973 (RegexReferenceFilter[onMatch=EXCLUDE,caseSensitive=false,reg
ex=.*\/([^\/]*)\/\1\/\1\/.*])
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/images/10/gmz_dqwz_pic.jpg (ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=jpeg,jpg,svg,gif,png,ico,caseSensitive=false])
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html
INFO  [CrawlerEventManager]           REJECTED_FILTER: http://gtob.ningbo.gov.cn/art/2018/9/5/art_316_944973.html (No "include" document filters matched.)

I would like to ask if there is any wrong from the config.

Thanks!

essiembre commented 5 years ago

The only thing I can think of is you added/modified the filtering rules after you ran the Collector a few times and it got that URL from the "crawlstore" cache. Do you have the same behavior if you delete your crawlstore directory and try again (starting from scratch)?

If the problem is always there, please share your full config to reproduce.

You may be interested to know there is a new flag added in the snapshot version that allows you to also include subdomains when you use stayOnDomain="true". E.g.:

<startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
...
FcrbPeter commented 5 years ago

Thanks for replying.

I tried includeSubdomains="true" and work fine.

let me share the config to you, you may find "crawler.plugin.ContainsReferenceFilter" and "crawler.plugin.UrlReferenceFilter". They are just similar with "com.norconex.collector.core.filter.impl.RegexReferenceFilter" but the checking condition is String.contains and String.startWith.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="MCGCS Web crawler">

  <!-- Decide where to store generated files. -->
  <progressDir>./output/progress</progressDir>
  <logsDir>./output/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">
      <!-- <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false"> -->
      <startURLs stayOnDomain="false" includeSubdomains="false" stayOnPort="false" stayOnProtocol="false">
        <!-- <url>http://app3.rthk.hk/search/google/start.php</url> -->
        <!-- <url>http://programme.rthk.hk/archivelist_gsa.php?channel=dtt31</url> -->
        <url>https://www.rthk.hk/</url>
        <!-- <url>https://news.rthk.hk/</url> -->
        <!-- <url>http://podcast.rthk.hk/</url> -->
        <!-- <url>http://app4.rthk.hk/special/rthkmemory/</url> -->
        <!-- <url>http://app4.rthk.hk/elearning/healthpedia/</url> -->
      </startURLs>

      <documentFilters>
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include"> -->
    <!-- ^http\:\/\/app3\.rthk\.hk\/search\/google\/start\.php -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    http://app3.rthk.hk/search/google/start.php
</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    rthk.hk/
</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="include">
    rthk.org.hk/
</filter>
      </documentFilters>

      <referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
    html,pdf,doc,docx,xls,xlsx,ppt,pptx,xml,xml,rtf
</filter>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">
    jpeg,jpg,png,gif,ico,mp3,mp4,avi,mkv,flv
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    ^http://.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    ^https://.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.org\.hk/.*
</filter>

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
    .*\/([^\/]*)\/\1\/\1\/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
    .*\/([^\/]*)\/([^\/]*)\/\1\/\2\/.*
</filter>

<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/press/presslistxml.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/press/preview/</filter>
<filter class="crawler.plugin.UrlReferenceFilter" onMatch="exclude">http://programme.rthk.hk/rthk/tv/programme.php?name=tv/cuhneverendingtrail&amp;d=2014-01-19&amp;p=6026&amp;e=239765&amp;m=episode</filter>
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://rthk9.rthk.hk/rthk/ch</filter> -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://rthk9.rthk.hk/rthk/en</filter> -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://news.rthk.hk/rthk/</filter> -->
<filter class="crawler.plugin.UrlReferenceFilter" onMatch="exclude">http://news.rthk.hk/rthk/ch/component/k2/1324001-20170407.htm?archive_date=2017-04-07</filter>
<!-- # -->
<!-- # Internal System -->
<!-- # -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*pop\.rthk\.(org\.)*hk.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/cms/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/newaps/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk-wt01</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*sdc.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">sdc.rthk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkcms2.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkcms3.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">dev1.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">dev2.rthk.hk</filter>
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">https://</filter> -->
<!-- # -->
<!-- # Repeat Domain -->
<!-- # -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app2.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app2.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">civilonline.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">civilonline.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">gbcode.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">gbcode.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">m.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">m.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">myrthkplus.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">myrthkplus.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">programme.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">publicaffairs.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">publicaffairs.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk8.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk8.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkcms2.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkcms3.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">search.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">search.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">www.teenpower.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">www.teenpower.rthk.org.hk</filter>
<!-- # -->
<!-- # News -->
<!-- # -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://news.rthk.hk</filter> -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/rthk/ch/component/k2/.*\.htm$</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/rthk/en/component/k2/.*\.htm$</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">news.rthk.hk/rthk/ch/video-gallery.htm</filter>
<!-- # News over 1 year -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/rthk/(ch|en)/component/k2/.*\-(2015|2016|2017|201801|201802|201803).*\.htm.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/rthk/ch/news\-archive\.htm\?archive\_year\=(2016|2017).*</filter>
<!-- # News (old) -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/.*share\=(facebook|twitter).*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/.*\&amp;\&amp;.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">newsprd.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">newsuat.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app1.rthk.hk/rthk/news</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/elocal.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/einternational.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/efinance.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/esport.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/egreaterchina.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/clocal.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/greaterchina.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/cinternational.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/cfinance.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/news/csport.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news/engnews.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news/engbulletin.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news/hourly_news.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news/newswrap.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news/pthnews.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news/summary.*</filter>
<filter class="crawler.plugin.UrlReferenceFilter" onMatch="exclude">http://rthk.hk/rthk/news/</filter>
<filter class="crawler.plugin.UrlReferenceFilter" onMatch="exclude">http://www.rthk.org.hk/rthk/news/</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk7\.rthk\.(org\.)*hk/php/efinance/qphk\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app1\.rthk\.(org\.)*hk/rthk/news/videonews/video\_list\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.(org\.)*hk/rthk/news/englishnews/[0-9]{8}/news\_[0-9]{8}\.htm.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.(org\.)*hk/rthk/news/englishnews/news\.htm.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app1.rthk.hk/rthk/news/photogallery/photo_slideshow_2010.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk27.rthk.org.hk/php/news_sendthis</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk.hk/efinance/enews_pop.htm</filter>
<!-- # News over 1 year -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/news/expressnews/2009</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/news/expressnews/2010</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/news/expressnews/2011</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/news/expressnews/2012</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/news/expressnews/2013</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/news/expressnews/2014</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/rthk/news/expressnews/index_news\.htm\?expressnews\&amp;(2009|2010|2011|2012|2013|2014).*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/rthk/news/expressnews/news\.htm\?expressnews\&amp;(2009|2010|2011|2012|2013|2014).*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/rthk/news/englishnews/(2009|2010|2011|2012|2013|2014).*</filter>
<!-- # -->
<!-- # Mobile -->
<!-- # -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk.hk/mobile/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app3.rthk.hk/mobile/</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app1\.rthk\.(org\.)*hk/text/.*</filter>
<!-- # -->
<!-- # Special -->
<!-- # -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app1\.rthk\.(org\.)*hk/special/psq2009/blog/index\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app1\.rthk\.(org\.)*hk/special/unusualjourney/photo/main\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*teenpower1\.rthk\.(org\.)*hk/teenspecial/tptv/searchresult\.php.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/apps/login_pop.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/apps/memberReg_pop.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/apps/popupPlayer.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/apps/popupPlayer_left.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/apps/popupPlayer_right.php</filter>
<!-- # disabled temporarily, error url # -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk</filter> -->
<!-- # -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app4\.rthk\.hk/apps/mine/index\.php.*share.*(facebook|twitter).*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/apps/login.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/programme/search.php</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*10books2002/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*healthy\_living/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*test/powis/lautinchi/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*mmgallery2002/d2/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*mmgallery2002/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*student10books/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*99books/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*leetm/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*leetm/test/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*pai\_hsien\_yung/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*lautinchi/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*travel/discussion/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*leetm2/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/bcr/action\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/bcr/book\.php.*rid\=.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/bcr/book\.php.*forum\=.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/bcr/bookflow\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special\_off/parenting2005/forum/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*elearning/ecotour/forum-test/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*elearning/ecotour/forum/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/dessert\_old/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/westkowloon/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/consumer/forum/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/parentingschool/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/chineseopera/forum/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/10books2005/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/dessert/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/dessert/new/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*publicaffairs/forum/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*teenpower\.rthk\.(org\.)*hk/photo\_gallery/photo.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk27\.rthk\.(org\.)*hk/php/leetm2/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*trial/dessert/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*elderly/elderly\_oldversion/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*elderly/forum/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/hkfilex/common/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/tea/discussion/html/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/tea/discussion/bak/messages.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*civilonline.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*special/bookmarks/add\_comment\.php.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app1.rthk.hk/special/bookmarks/add_rating.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/publicaffairs/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/podcast/other_programme\\.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">tptvPlayer.swf</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/tptv/searchresult.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/teenspecial/tptv/searchresult.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/download.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/painter/</filter>
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/draw.php</filter> -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/elearning/hkillustrators/draw\.php\?sid\=.*t\=2.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/elearning/hkillustrators/draw\.php\?sid\=[0-9]*$</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/report.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/reply.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/nomessage.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/elearning/hkillustrators/gallery.php</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*podcast\.rthk\.hk/podcast/item\.php\?.*\&amp;lang\=zh\-CN.*</filter>
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*podcast\.rthk\.hk/podcast/item\_all\.php?.*\&amp;lang\=zh\-CN.*</filter> -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower1.rthk.hk/tptv/1.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower1.rthk.hk/tptv/index.php?</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower1.rthk.hk/teenspecial/ogcio11/email_form/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app3.rthk.hk/special/teenwalker/reply.php</filter>
<!-- # podcast -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*podcast/.*lang\=en\-US.*</filter>
<!-- # remove temporarily from 12 Jan 2012(deadloop) -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/special/teentime/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">cibs.rthk.hk/botdetect/</filter>
<!-- # -->
<!-- # rthkmemory (remove old version) -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app3.rthk.hk/special/rthkmemory/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkmemory_uat</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app4\.rthk\.hk/special/rthkmemory/.*lang\=.*</filter>
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app4.rthk.hk/special/rthkmemory/</filter> -->
<!-- # -->
<!-- # Radio/TV -->
<!-- # -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">dab31.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">dab33.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">dab35.rthk.hk</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*www\.rthk\.(org\.)*hk/rthk/program\_archive\.cgi.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*programme\.rthk\.(org\.)*hk.*(default\_banner|ad\_banner)\.swf.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/presenters/(index|programme)\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.org\.hk/rthk/tv/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.org\.hk/rthk/radio./.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.org\.hk/rthk/pth/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">^http\://www\.rthk\.hk/rthk/tv/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">^http\://www\.rthk\.hk/rthk/radio\./.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">^http\://www\.rthk\.hk/rthk/pth/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">^http\://rthk\.hk/rthk/tv/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">^http\://rthk\.hk/rthk/radio\./.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">^http\://rthk\.org\.hk/rthk/pth/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/index\.php.*type\=prog.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/index\.php.*tag\=.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/programme\.php.*name\=radio.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/programme\.php.*name\=pth.*</filter>
<!-- # suspend for %E6%9B%B4%E5%A4%9A button -->
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/programme\.php.*m\=archive.*</filter> -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/programme\.php.*m\=photo.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/index\.php.*type\=prog.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/index\.php.*tag\=.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/index\.php.*m\=album.*\&amp;page\=.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/programme\.php.*m\=photo.*</filter>
<!-- # suspend for %E6%9B%B4%E5%A4%9A button -->
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/programme\.php.*m\=archive.*</filter> -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/programme\.php\?d\=.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/programme\.php\?name\=[^/^&amp;]*\&amp;.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/programme\.php\?name\=/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/programme\.php\?name\=[^/^&amp;]*\&amp;.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/programme\.php\?name\=/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.(org\.)*hk/rthk/schedule/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/channel/presenters/profiles\.php.*photoid\=\-.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/channel/radio/player\_popup\.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/rthk/tv/player_popup\\.php</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*/channel/radio/programme\.php\?name\=.*\&amp;e\=[0-9]+\&amp;m\=episode.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*programme\.rthk\.hk/channel/radio/programme\.php.*d\=(2011\-|2012\-|2013\-|2013\-02|2013\-03|2013\-04|2013\-05|2013\-06|2013\-07).*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">dab33/r3_programmes</filter>
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk/tv/index\.php.*m\=archive.*</filter> -->
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*channel/radio/index\.php.*m\=archive.*</filter> -->
<!-- # -->
<!-- # revamp2016 -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkwebuat.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthkwebdev.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">preview.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk.hk/archive/</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.hk/.*\?.*lang\=</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.hk/.*\?.*share\=</filter>
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">programme.rthk.hk</filter> -->
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*http\://programme\.rthk\.hk/[^/]*/[^b].*</filter> -->
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*http\://programme\.rthk\.hk/[^/]*$</filter> -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/channel/radio/</filter> -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/rthk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/main</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/assets</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/player_popup.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/player_txt.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/programme.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/index.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/profiles.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/index_archive.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/traffic_news.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://programme.rthk.hk/channel_index.php</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">http://www.rthk.hk/radio/radio1/programme/traffic_news</filter>
<!-- # -->
<!-- # Others -->
<!-- # -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">detectflash=false</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/podcast/invite_chi.php</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.(org\.)*hk/about/(chi|eng)/.*</filter>
<!-- <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.(org\.)*hk/press/.+</filter> -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">/press/popup_email.php</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*membership/message.*\.php.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*php/discussion\_pro.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*rthk\.(org\.)*hk/progsch/.*</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*\.mp4\?ref\=web_sq$</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*\.mp4\?ref\=web_hq$</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*\.mp4.*</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">stmw.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">stmw1.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">stmw2.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">stmw3.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">archive.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">search.rthk.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">.hk/sharing/sharing.php</filter>
<!-- # -->
<!-- # -->
<!-- # remove duplicate podcast domain -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">podcast.rthk.org.hk</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">podcasts.rthk.hk</filter>
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">podcast.rthk.hk</filter> -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">special/rthkmemory/</filter> -->
<!-- # -->
<!-- # remove problem page temporary(spam) -->
<!-- <filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/programme/comicaction/</filter> -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower.rthk.hk/programme/</filter>
<!-- # -->
<!-- # remove old content after revamp 2016 -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">rthk9.rthk.hk/rthk/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower1.rthk.hk/teenspecial/ogcio11/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower1.rthk.hk/teenspecial/antidrugs2010/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">teenpower1.rthk.hk/teenspecial/designthinking11/</filter>
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">app3.rthk.hk/special/ezone/</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*news\.rthk\.hk/rthk/.*/component/k2/.*\-2016.*\.htm</filter>
<!-- # -->
<!-- # remove infinity loops -->
<filter class="crawler.plugin.ContainsReferenceFilter" onMatch="exclude">www.rthk.hk/www.rthk.hk</filter>
<!-- # -->
<!-- # remove press temporarily -->
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">.*app3\.rthk\.hk/press/.*html.*</filter>
      </referenceFilters>

      <userAgent>gsa-crawler</userAgent>
      <workDir>./output</workDir>

      <orphansStrategy>DELETE</orphansStrategy>

      <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" charset="UTF-8" />
      </linkExtractors>
      <httpClientFactory>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <!-- <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols> -->
      </httpClientFactory>
      <!-- <sitemapResolverFactory ignore="true" /> -->
      <!-- <robotsTxt ignore="true" /> -->
      <!-- <robotsMeta ignore="true" /> -->

      <maxDepth>-1</maxDepth>
      <numThreads>4</numThreads>
      <delay default="100" scope="thread" />

      <importer>
        <preParseHandlers>
          <transformer class="com.norconex.importer.handler.transformer.impl.StripBetweenTransformer" inclusive="true">
            <restrictTo field="document.contentType">text/html</restrictTo>
            <stripBetween>
              <start><![CDATA[<!--googleoff: index-->]]></start>
              <end><![CDATA[<!--googleon: index-->]]></end>
            </stripBetween>
          </transformer>
          <transformer class="com.norconex.importer.handler.transformer.impl.StripAfterTransformer" inclusive="true">
            <restrictTo field="document.contentType">text/html</restrictTo>
            <stripAfterRegex><![CDATA[<!--googleoff: index-->]]></stripAfterRegex>
          </transformer>
          <tagger class="com.norconex.committer.googlecloudsearch.BinaryContentTagger"/>
        </preParseHandlers>

        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>binaryContent,document.reference,document.contentType,collection,score,title,description,channelName,programmeName,episodeDate,episodeName,image</fields>
          </tagger>
        </postParseHandlers>
      </importer>

      <!-- <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./output/crawledFiles</directory>
      </committer> -->

      <!-- <committer class="com.norconex.committer.core.impl.NilCommitter" /> -->

      <committer class="com.norconex.committer.core.impl.JSONFileCommitter">
        <directory>./output/crawledFiles</directory>
        <pretty>true</pretty>
        <docsPerFile>200</docsPerFile>
        <compress>false</compress>
        <splitAddDelete>false</splitAddDelete>
      </committer>

      <!-- <committer class="com.norconex.committer.googlecloudsearch.GoogleCloudSearchCommitter">
        <configFilePath>./config/sdk-configuration.properties</configFilePath>
        <uploadFormat>raw</uploadFormat>
      </committer> -->

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 5 years ago

Given it works for you now. Can we close? Feel free to submit a pull request if you feel your filters are ready for general use.

FcrbPeter commented 5 years ago

I met a similar problem again.

Let say I am crawling start from a URL http://programme.rthk.hk/episodelist_gsa.php?channel=radio1&prog=1094. There are two situations.

  1. <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
  2. <startURLs stayOnDomain="false" includeSubdomains="false" stayOnPort="false" stayOnProtocol="false">

Both two situations are included same referenceFilters

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.hk/.*
</filter>
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.org\.hk/.*
</filter>

In situation 1: The collector don't extract any URLs from the start URL, the start URL contains lot of urls which are in domain www.rthk.hk.

In situation 2: The collector seems can extract the URLs from the start URL, but it will follow to the URLs which are outside of rthk.hk and rthk.org.hk.

Also, I am always removing the "crawlstore" cache (which I store them into an output folder) before start the crawling.

Below is the config files, crawled logs and the crawled document results. Please check on it. crawler_problem.tar.gz

essiembre commented 5 years ago

1. cannot-extract-links Work as expected. Your start URL is on programme.rthk.hk and you want to stay on domain + subdomains. A link to www.rthk.hk domain gets rejected because it is not a subdomain of programme.rthk.hk. If you want all subdomains from rthk.hk you have to make that domain your start URL. Else, use filters like you did.

2. processed-outside-domain Also works as expected. According to your logs (and my tests), I could not find documents that were committed that did not match your filter rules.

For example, https://www.facebook.com/RTHK.HK/ is matched by your .*rthk\.hk/.* filter, which is case-insensitive by default. You will have to refine your filters to be more restrictive if you want to exclude those.

FcrbPeter commented 5 years ago

Thanks for replying.

1. cannot-extract-links Thanks for the clarification. In this case, I don't want to crawl the rthk.hk as the rthk.hk is not the target domain but all the subdomains of rthk.hk. I am thinking about if it is good to have a domain option, which the stay on domain + subdomains can follow the domain from the option. It would be more make sense than adding filters to achieve same goal.

2. processed-outside-domain Yes, there is no any unwanted documents committed due to the document filters. However, the crawler processed the unwanted url, such as https://zh-hk.facebook.com/login/ (line 96582, 96629 in the log). In the log, it told me that the url had been fetched and extracted the child urls. The problem is the crawler do not stop and keep crawling around the WWW, which is not the expected (According to the reference filter below), when the maxDepth is -1.

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
    .*rthk\.hk/.*
</filter>

For the case you mentioned, yes, I have changed the filter to ^https?://[^/]*rthk\.hk/.* which is more restrictive. However, the crawler is still crawling outside of rthk.hk with this filter.

essiembre commented 5 years ago

The referenceFilters will filter out unwanted URLs before they are downloaded. Make sure you make it restrictive enough there as well.

As far as having subA.main.com also accepting subB.main.com when includeSubdomains is true, I am marking it as a feature request to have that configurable. The "stayOnXXX" strategy will likely be revisited/improved in the next significant release, as it is also similar to #614.