Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Unable to import tagValues in AWS Cloudsearch #59

Closed umeshkalia closed 7 years ago

umeshkalia commented 7 years ago

I am trying to crawl web pages to store in AWS Cloud Search and facing problem in storing tags value in cloudsearch. Below are details of problem:

I am able to see both title and h3 in Debug LOG. Value of "title" is being successfully imported in AWS cloudsearch, but contents of "h3" are not being added in AWS cloudsearch.

I am always deleting committer-queue and crawler folders everytime before I run http-collector.

Below is the import configuration used by me:

<importer>
<preParseHandlers>
         <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="h3" toField="h3" overwrite="true" defaultValue="Nil"/>  
       </tagger>
    <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
           logFields="h3"   logLevel="INFO">
          </tagger>
</preParseHandlers>
        <postParseHandlers>
         <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,description,content,popularity</fields>
          </tagger>
          <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
           logFields="title"   logLevel="INFO">
          </tagger>
        </postParseHandlers>
      </importer> 

Debug LOG is: image

Please suggest solution asap.

essiembre commented 7 years ago

On first glance, I see nothing wrong. What does AWS Cloud search tells you when you try to add this document? Any indication in CloudSearch logs?

Have you tried adding the h3 field to your SolrCloud schema before indexing?

umeshkalia commented 7 years ago

Thanks for your reply.

I had already resolved it by adding "h3" in POSTPARSEHANDLER section (as shown below).

image