Closed avi7777 closed 6 years ago
You should probably have your taggers reversed, like this:
<tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
<rename fromField="document.reference" toField="reference"/>
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>reference,title,description,content,pagecontent</fields>
</tagger>
That is: first rename document.reference
to reference
, then mention reference
in KeepOnlyTagger
.
The documentation recommends having the KeepOnlyTagger
last.
Also, you can use ForceSingleValueTagger
on the pagecontent
field to avoid having multiple.
Placing the Keeponlytagger at the end , did not provide me the value for reference , when i checked it in the documents generated. So i kept it as above itself. As you mentioned , ForceSingleTagger worked well for me. Thank you for the response.:)
Can you please provide your response for this ticket as well. https://github.com/Norconex/collector-http/issues/442
@avi7777 does that mean you are OK with @ronjakoi response and can close this issue?
yes. This issue can be closed.
Thanks for confirming.
I want to extract the text present inside all the
<h2>
tags in the page i am crawling. I have created a field named "pagecontent" with collection(Edm.string) type and used below setting to fetch the text of<h2>
tag.:This is my posthandler setting:
But when the collector command executes getting below error:
Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : An unexpected 'PrimitiveValue' node was found when reading from the JSON reader. A 'StartArray' node was expected.\r\n"}}
Value that is being stored in the META file of queue directory is(
<h2>
tag occurs twice in the below scenario):I have also tried by changing the field type of pagecontent field from collection(Edm.string) type to Edm.string type. That case i would get below error:
Azure Response: {"error":{"code":"","message":"The request is invalid. Details: parameters : An unexpected 'StartArray' node was found when reading from the JSON reader. A 'PrimitiveValue' node was expected.\r\n"}}
Help to find out the fix for this or suggest if there is any other alternative like storing only content of
<h1>
and<h2>
tags in the content file which will store entire body content by default.