Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Tagger to analize meta #337

Closed aleha84 closed 7 years ago

aleha84 commented 7 years ago

I need some tagger tool which will give me ability to analize specified meta and write some data to another meta.

If more specific, then I need to create extra field which should contain info is downloaded data is file or page based on content type (text/html or apllication/*).

Now i'm using this count tagger:

<tagger class="com.norconex.importer.handler.tagger.impl.CountMatchesTagger">  
        <countMatches caseSensitive="false" fromField="content-Type"  toField="isFile" regex="true">
        application/.*
    </countMatches>
</tagger>

But i think it is ugly and not flexible.

essiembre commented 7 years ago

You can consider using the ReplaceTagger instead, like this:

<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="content-Type" toField="fileType" regex="true">
          <fromValue>application/.*</fromValue>
          <toValue>app</toValue>
      </replace>
      <replace fromField="content-Type" toField="fileType" regex="true">
          <fromValue>text/html</fromValue>
          <toValue>html</toValue>
      </replace>
  </tagger>

If your logic gets to complex you can also consider using the ScriptTagger for more control.

essiembre commented 7 years ago

You may also want to consider using this field created after parsing took place: "document.contentFamily".

aleha84 commented 7 years ago

i'm also using KeepOnlyTagger. How to specify "document.contentFamily" to be keeped? Seem that "document.contentFamily" or "document_contentFamily" dont' work.

essiembre commented 7 years ago

"document.contentFamily" should do it if added to your KeepOnlyTagger. If not, please share your config. You can also use the DebugTagger to see all fields that are set on your document at any point during the import process.

aleha84 commented 7 years ago

"document.contentFamily" - work's fine in KeepOnlyTagger it was some Kibana fields caching issue (needed to refresh). I need a certain sign for downloaded URL - was it file or html page, a boolean type will be more desirable. I will check statistics for document.contentFamily field after full reindex to make a conclusion how to use it. Thanx.

essiembre commented 7 years ago

If you are using the latest snapshot, here is an example how you can obtain a boolean like you want it:

 <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="content-Type" toField="isFile" regex="true">
          <fromValue>application/.*</fromValue>
          <toValue>true</toValue>
      </replace>
  </tagger>
  <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger" onConflict="noop">
      <constant name="isFile">false</constant>
  </tagger>
aleha84 commented 7 years ago

Just tried ReplaceTagger as you recommended. And it wont worked at all. Checked with different attributes. Field "isFile"=null in debug. I think regex doesn't match.

essiembre commented 7 years ago

Make sure you have the latest snapshot for the onConflict="noop" to work. Other than that, share your file to reproduce. I kept what you had in my example, but maybe the fromField should be different? Maybe try with document.contentType instead.