Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

fields into Elasticsearch #24

Closed jacksonp2008 closed 7 years ago

jacksonp2008 commented 7 years ago

Testing your products, so far crawler works and was much easier to follow than Nutch or Stormcrawler.

I am sending to elastic 5.6 and the data is there. However, I am a bit confused on the fields as I am not seeing what I had expected. I would like to see ALL available data fields sent to ES, and then I can pair back as needed.

In your example:

    <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
           <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer>

If I don't set it provides even less data in ES. How can I send all data fields the crawler finds? (or... how do I know what data fields are available to set this value?)

Also confused on sourceContentField & targetContentField. ie. what are the sourceContent fields to choose from? I really want to end up with a field in ES that has all the text from the page or document.

Thanks!

essiembre commented 7 years ago

About the importer: You can comment the whole KeepOnlyTagger block to get all fields it finds. Another approach to help find what fields are discovered before documents are committed is to use the DebugTagger.

About the Elasticsearch Committer: The sourceContentField is for you to overwrite which field from the document being committed holds the document "content". By default, it does not take a field, but rather the actual document content and for typical crawls, this is usually what you want and you can ignore this field.

The targetContentField is related. That is in what Elasticsearch field the document content will be stored. Default it is "content".

Does that answer?

jacksonp2008 commented 7 years ago

Yes, in fact I am so pleased now that I understand how it works that I am going to use this for all my systems if I can work out the SSO. Thank-you Pascal for taking the time to respond!

essiembre commented 7 years ago

No problem! I am glad you like it. Hopefully, SSO will not be a big deal to add.