Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

domain restrication #284

Closed doaa-khaled closed 7 years ago

doaa-khaled commented 8 years ago

There is a strange behavior; the file which with that URL hadn't been catched http://datasheets.avx.com/TCJ.pdf even it exists in the URL http://www.avx.com/awards/finalist-for-ubm-techs-ee-times-and-edn-ace-award/ Is that because I let the configuration in xml file to be restricted with the same domain?

essiembre commented 8 years ago

Yes, if you restrict to the same domain, everything not on that domain will be rejected. If you want that file, you'll need to not restrict on domain, and rely on either reference filters or import filter.

Be careful if choosing reference filters, as they reject URLs before they are downloaded... so URLs to be followed are not extracted on rejected URLs. This is in many case OK and just more efficient, but there are cases you want documents downloaded to extract their URLs but otherwise reject them. In such cases, use import filters (I recommend as a pre-parse handler).

Make sense?

doaa-khaled commented 8 years ago

you mean by import filter to filter wanted files extensions ? and I want to ask if I can reject specific websites in crawling process like social media websites or ads ?

essiembre commented 8 years ago

You can filter on any field/value you want with the import filters. For instance, to filter on the URL (called document.reference at that point), you can use this:

<importer>
    <preParseHandlers>

      <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
          onMatch="include" field="document.reference" >
        Your regular expression matching URL pattern here
      </filter>

      ...

    </preParseHandlers>
</importer>

Make sure you have "stayOn..." to false in your start URLs. The above will not prevent any pages/documents from being downloaded but will reject them before they reach your committer.

But it all depends what you want to do with your crawl and a little more explanation would help.

For instance, if you want to crawl only "www.avx.com" pages, except for PDFs which can come from anywhere, you can also make "stayOn..." to false and use reference filters like this:

<referenceFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" >
      (http://www\.avx\.com/.*|.*\.pdf)
  </filter>
</referenceFilters>

That scenario will prevent downloading any files that do not match the pattern, so you save on bandwidth and is overall faster. Just make sure you do not reject pages that contain links to other pages you are interested in as you will not get them (use the import filter approach if that is a concern). If you know the list of domains you what to allow the crawler to go to, you can also list domains in the regex instead.

You have several options to achieve what you want, it all depends what that is.

Any clearer?

doaa-khaled commented 8 years ago

Dear Essiembre, thank you for your reply, I want to catch any file with some types like pdf if there is a reference to it in my supported URL, even that file locate in subdomain of that URL or another different URL, but I afraid of making "stayondomain" to be false which will lead to go though any other unwanted websites and that will consume much time and bandwidth, so I want to limit it in the level of I have a reference to it in my URL only. hope that be more clearer.

essiembre commented 8 years ago

If I understand you right, you want to crawl your domain only, plus direct links from your domain, but not further?

If you are using a fairly recent version of 2.6.0 snapshot, the URL referrer information is always kept, which allows you to do something like this:

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include" >
            http://www.avx.com.*
        </filter>      
        <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="collector.referrer-reference" onMatch="include" >
            http://www.avx.com.*
        </filter>      
      </documentFilters>

This configuration tells the crawler at least one of the "include" filters must be matched for a document to be processed. So we have a filter that accepts only your domain, and another one that accepts URL that has your domain as the referrer.

I believe this will give you what you want. Please confirm.

essiembre commented 8 years ago

I just want to add, that if you are after specific file types, my original suggestion will also work and be much more efficient (saving you many downloads):

<referenceFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" >
      (http://www\.avx\.com/.*|.*\.pdf)
  </filter>
</referenceFilters>

As long as you do not have HTML as an extension for the second part of the regular expression (since HTML will see its links extracted and followed), it should work just fine. Have you tried it? Didn't it produce what you were after?

doaa-khaled commented 8 years ago

I tried to add this

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" >
      (http://www\.avx\.com/.*|.*\.pdf|xls|xlsx|doc|docx|ppt|pptx|zip)
  </filter>

and the crawler reject first link as it doesn't match with that form and stop crawling!

doaa-khaled commented 8 years ago

and I have a question, regarding that configuration..

<documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include" >
            http://www.avx.com.*
        </filter>      
        <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="collector.referrer-reference" onMatch="include" >
            http://www.avx.com.*
        </filter>      
      </documentFilters>

is first part will not prevent the redirect ??

essiembre commented 8 years ago

About your filter that stops the crawling, it is normal with the regular expression you have defined.

.*\.pdf means any URL ending with PDF, but just having entries like xls means the URL has to match exactly "xls". In other words, you are missing .*.\. for every extension. If you do not want to repeat it, you can use more elaborate regex, like this (untested):

(http://www\.avx\.com/.*|.*\.(pdf|xls|xlsx|doc|docx|ppt|pptx|zip))

About redirects, it all depends where it redirects to. If the redirect target URL matches your filters it will be processed, else, it will be rejected.

doaa-khaled commented 8 years ago

I tried it as well and didn't work. Regarding the second modification, when I changed the "keeponDomain = false" and put the restriction of

<documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                onMatch="include" >
            http://www.avx.com.*
        </filter>      
        <filter class="com.norconex.collector.core.filter.impl.RegexMetadataFilter"
                field="collector.referrer-reference" onMatch="include" >
            http://www.avx.com.*
        </filter>      
      </documentFilters>

the crawler goes through the other websites so far and instead the crawler was finished in 3 days, it reach around 21 days and didn't stop updating incrementally! is there a solution to just stop at the level I mentioned above ?

essiembre commented 8 years ago

It works for me with the RegexReferenceFilter trick of accepting domain or specific mime-types. Why do you say this does not work?

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include" >
      (http://www\.avx\.com/.*|.*\.(pdf|xls|xlsx|doc|docx|ppt|pptx|zip))
</filter>

What happens?

Since I can not reproduce the issue you are having, I am afraid you will have to attach your full config.

essiembre commented 7 years ago

Closing due to lack of feedback and inability to reproduce.

raskolnikov7 commented 5 years ago

Hi i am looking to crawl and commit pages from a domain based on URL pattern. Say top level is https://www.abc.com and then i have patternA and patternB there. https://www.abc.com/personal and https://www.abc.com/business should be committed to different stores. Now i have this for startURLs in config

https://www.abc.com/ https://www.abc.com.sitemap.xml

I have to keep it like above as the URL, say https://www.abc.com/personal has a redirect to this. The filter is as below :

   <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" caseSensitive="false">
        .*/personal/.*
    </filter>

This is getting rejected by the filter.

essiembre commented 5 years ago

This ticket is closed. Please open a new one for new questions. See #636.