Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

crawl on specific types of files #270

Closed doaa-khaled closed 8 years ago

doaa-khaled commented 8 years ago

Hi All, When I include specific data types in filtering, crawler doesn't work properly, It seems as I didn't include html pages he be unable to reach these files.. is their a solution for that ?

essiembre commented 8 years ago

The solution would be to filter documents in the Importer module, that way links to be followed are already extracted before filtering takes place. Not sure what you are doing exactly. Please attach your config.

doaa-khaled commented 8 years ago

am trying to do that in importer stage and that is my configuration

<importer>
<postParseHandlers><filter class="com.norconex.importer.handler.filter.impl.EmptyMetadataFilter"
          onMatch="include" fields="pdf,xls,xlsx,doc,docx,ppt,pptx" /></postParseHandlers>
</importer>
<committer class="com.norconex.committer.core.impl.FileSystemCommitter"><directory>$workdir\crawledFiles\22</directory>
</committer>

but when I checked the folder where I save imported files I found files with different formats, can you tell me what is wrong in my setting ?

essiembre commented 8 years ago

The EmptyMetadataFilter is meant to accept/reject documents that do not have a specific metadata field (or they have it but is empty).

So what you are asking in your config is to only include documents that have a field named "pdf", or "xls", etc, with values in them. The next effect is likely to reject pretty much all your documents.

If you want to crawl a site but in the end, only keep files of specific format, you are right to do this at the importer level, but you need to use a different filter. I suggest you try the RegexMetadataFilter, like this (not tested):

<importer>
  <preParseHandlers>

    <!-- to filter on URL extension: -->
    <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
        onMatch="include" field="document.reference">
      .*(pdf|xls|xlsx|doc|docx|ppt|pptx)$
    </filter>

    <!-- to filter on content type (probably best if your URLs do not always have an extension): -->
    <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
        onMatch="include" field="document.contentType">
      (application/pdf|anotherOne|yetAnotherOne|etc)
    </filter>

  </preParseHandlers>
</importer>

While it would work both ways, I recommend you configure these filters as a "pre" parse hander. The document reference and conten types are metadata fields available before parsing occurs. Since you do not want to keep those documents, there is no point in parsing them, so we filter them out before parsing occurs to save some processing.

doaa-khaled commented 8 years ago

yes, it works now ! thanks a lot