Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Configuration Question #110

Closed hanabadler closed 9 years ago

hanabadler commented 9 years ago

Hi I would like to know how to configure the collector to collect only images reference in url i am writing a custom committer that need to send to REST api all urls in a website that have images in the html and images src that exist in the ur the outcome i want is something like this

url1-img1,img2,img3 url2-img11,imge22,img33,img44

i tryed to play with the following tags

  <referenceFilters>
    <filter 
        class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
        onMatch="include">
        jpg,gif,png,ico,css,js
        </filter>
    <filter 
        class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
        onMatch="include" >
    .*
    </filter>
  </referenceFilters>

but this didnt worked for me in addition, i started implementing the committer (similar to solr commiter), by extending the AbstractMappedCommitter and implementing commitBatch method. i looked at the ICommitOperation object but didnt saw the actual content of the html page

can you help me in this? thanks Hana

essiembre commented 9 years ago

There are multiple ways to achieve this, with varying levels of complexity. It really depends how you would like the data stored in the end. I'll give you a few samples and pointers.

By using a ReplaceTagger within your <importer> config section, you can extract all URLs pointing to images and store them in a new field:

<preParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
    <replace fromField="collector.referenced-urls" toField="document.images" regex="true">
      <fromValue>(.*.gif$|.*.png$|.*.jpg$)</fromValue>
      <toValue>IMAGE=$1</toValue>
    </replace>
  </tagger>
</preParseHandlers>

The result will be that your HTML pages will be indexed with a multi-value field called "document.images" which will contain all images for that page. Prefixing the new value is important since the replace won't take place if the toValue is the same as the fromValue. If you want to clean it up so you only have relevant fields, you can use KeepOnlyTagger:

<postParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
    <fieldsRegex>^document\..*</fieldsRegex>
  </tagger>
</postParseHandlers>

Even if it is the image URLs you are interested in, it is the HTML files you want to crawl to obtain those:

<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="include">
    .html
</filter>

If it is the images you want to save instead, that's something else.

If you need this more for reporting purposes, you can also look at implementing an ICrawlerEventListener and decide what to log in your own file. The crawler event you would be interested in is defined by this constant: HttpCrawlerEvent.URLS_EXTRACTED. The "subject" argument in the crawler event object will be the list of URLs. You can then keep only images URLs from that list and store them in your report.

For your committer implementation, the ICommitOperation will either be an instance of IAddOperation or IDeleteOperation. When dealing with adds, you will find in IAddOperation a method called getContentStream to get the content.

essiembre commented 9 years ago

Closing for receiving no feedback on last answer. Please create a new issue if you have additional questions.