Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Indexing value of a new metatag #791

Closed sudeshna-majumder closed 2 years ago

sudeshna-majumder commented 2 years ago

Hello,

I have added few new metatags to my page. Ex.
<meta name="content-region" content="Global" />

I want to extract the value of 'content-region' field and index as 'region'. I am committing to Google Cloud search. If I use below pre-purse Handler I am expecting it would store the value 'Global' in 'region' field. But DebugTagger says it stores <null> every time.

<tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger"> <copy fromField="content-region" toField="region" overwrite="true" ></tagger>

Am I missing anything basic here ?

essiembre commented 2 years ago

I suspect it is related to having it defined as part of the pre-parse handlers. Fields extracted from a file content are created when the document is parsed. So before parsing you would not have the meta field. Try moving your logic as a post-parse handler.

If that does not work for you, please share the version you are using and your full config in order for us to reproduce it.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sudeshna-majumder commented 2 years ago

Thanks Pascal. Moving the logic to post-parse handler, I am able to extract values from new metatags. But those are not being committed to my goggle-cloud-search committer. In the below support document I don't see any configuration possibility with norconex to commit additional metatags to cloud-search-committer. https://developers.google.com/cloud-search/docs/guides/norconex-http-connector#configure-gcs Do you know about any possible way to commit them to google-cloud-search ?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.