Extract content of only <h2>tags of the page

avi7777 commented 6 years ago

I want to extract the text present inside all the <h2> tags in the page i am crawling. I have created a field named "pagecontent" with collection(Edm.string) type and used below setting to fetch the text of <h2> tag.:

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">       
    <dom selector="h2" toField="pagecontent" extract="text" defaultValue="No_Content" />
        </tagger>

This is my posthandler setting:

<postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference,title,description,content,pagecontent</fields>
          </tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>          
          </tagger>
</postParseHandlers>

But when the collector command executes getting below error:

Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : An unexpected 'PrimitiveValue' node was found when reading from the JSON reader. A 'StartArray' node was expected.\r\n"}}

Value that is being stored in the META file of queue directory is( <h2> tag occurs twice in the below scenario):

document.reference="......"
title="......"
description="....."
pagecontent="......"
pagecontent="......"

I have also tried by changing the field type of pagecontent field from collection(Edm.string) type to Edm.string type. That case i would get below error:

Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : An unexpected 'StartArray' node was found when reading from the JSON reader. A 'PrimitiveValue' node was expected.\r\n"}}

Help to find out the fix for this or suggest if there is any other alternative like storing only content of <h1> and <h2> tags in the content file which will store entire body content by default.

ronjakoi commented 6 years ago

You should probably have your taggers reversed, like this:

<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
    <rename fromField="document.reference" toField="reference"/>          
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
    <fields>reference,title,description,content,pagecontent</fields>
</tagger>

That is: first rename document.reference to reference, then mention reference in KeepOnlyTagger.

The documentation recommends having the KeepOnlyTagger last.

ronjakoi commented 6 years ago

Also, you can use ForceSingleValueTagger on the pagecontent field to avoid having multiple.

avi7777 commented 6 years ago

Placing the Keeponlytagger at the end , did not provide me the value for reference , when i checked it in the documents generated. So i kept it as above itself. As you mentioned , ForceSingleTagger worked well for me. Thank you for the response.:)

Can you please provide your response for this ticket as well. https://github.com/Norconex/collector-http/issues/442

essiembre commented 6 years ago

@avi7777 does that mean you are OK with @ronjakoi response and can close this issue?

avi7777 commented 6 years ago

yes. This issue can be closed.

essiembre commented 6 years ago

Thanks for confirming.

Norconex / crawlers

Extract content of only <h2>tags of the page #447