Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

ReplaceTransformer matching issue? #686

Closed svanschalkwyk closed 4 years ago

svanschalkwyk commented 4 years ago

Input line is:

<meta name="p_similar_products">/ip/Organic-Carrots-2-lb-bag/44391103?athcpid=4

Expecting "/ip/" to be replaced. Configuration is this:

<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
    <restrictTo caseSensitive="false"
              field="p_similar_products">
                .*
       </restrictTo>
       <replace>
              <fromValue>^\/ip\/.*</fromValue>
              <toValue>https://grocery.walmart.com/ip/</toValue>
             </replace>
</transformer>
essiembre commented 4 years ago

Transformers are dealing with the content/body of your document. The "restrictTo" is to limit which documents are affected by your config, not fields. You seem to want to do this on a field. In which case, try using the ReplaceTagger and make sure to specify the "fromField".