Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Extract content of only <h2>tags of the page #447

Closed avi7777 closed 6 years ago

avi7777 commented 6 years ago

I want to extract the text present inside all the <h2> tags in the page i am crawling. I have created a field named "pagecontent" with collection(Edm.string) type and used below setting to fetch the text of <h2> tag.:

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">       
    <dom selector="h2" toField="pagecontent" extract="text" defaultValue="No_Content" />
        </tagger>

This is my posthandler setting:

<postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>document.reference,title,description,content,pagecontent</fields>
          </tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="document.reference" toField="reference"/>          
          </tagger>
</postParseHandlers>

But when the collector command executes getting below error:

Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : An unexpected 'PrimitiveValue' node was found when reading from the JSON reader. A 'StartArray' node was expected.\r\n"}}

Value that is being stored in the META file of queue directory is( <h2> tag occurs twice in the below scenario):

document.reference="......"
title="......"
description="....."
pagecontent="......"
pagecontent="......"

I have also tried by changing the field type of pagecontent field from collection(Edm.string) type to Edm.string type. That case i would get below error:

Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : An unexpected 'StartArray' node was found when reading from the JSON reader. A 'PrimitiveValue' node was expected.\r\n"}}

Help to find out the fix for this or suggest if there is any other alternative like storing only content of <h1> and <h2> tags in the content file which will store entire body content by default.

ronjakoi commented 6 years ago

You should probably have your taggers reversed, like this:

<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
    <rename fromField="document.reference" toField="reference"/>          
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
    <fields>reference,title,description,content,pagecontent</fields>
</tagger>

That is: first rename document.reference to reference, then mention reference in KeepOnlyTagger.

The documentation recommends having the KeepOnlyTagger last.

ronjakoi commented 6 years ago

Also, you can use ForceSingleValueTagger on the pagecontent field to avoid having multiple.

avi7777 commented 6 years ago

Placing the Keeponlytagger at the end , did not provide me the value for reference , when i checked it in the documents generated. So i kept it as above itself. As you mentioned , ForceSingleTagger worked well for me. Thank you for the response.:)

Can you please provide your response for this ticket as well. https://github.com/Norconex/collector-http/issues/442

essiembre commented 6 years ago

@avi7777 does that mean you are OK with @ronjakoi response and can close this issue?

avi7777 commented 6 years ago

yes. This issue can be closed.

essiembre commented 6 years ago

Thanks for confirming.