Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Update a new Solr created field when a particular crawled field contains a certain string #156

Closed mitchelljj closed 9 years ago

mitchelljj commented 9 years ago

I have a field called "DC-ED.audience" that contains multiple strings that are separated by commas (see below example): "DC-ED.audience":["Institutions of Higher Education", "Administrators; Counselors"]

If I create new fields within Solr before doing the initial crawl like a "Institutions of Higher Education" field then when doing the crawl I would like to key on the "DC-ED.audience" field and when a string like "Institutions of Higher Education" is found update the new "Institutions of Higher Education" field with the value of "TRUE".

essiembre commented 9 years ago

There probably a few ways you could do this, but I suggest you look at the ReplaceTagger part of the Importer module. You could use it in a way similar to this:

<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
  <replace fromField="YourFieldHavingTheValue" toField="Institutions_of_Higher_Education"
          regex="true">
      <fromValue>.*DC-ED\.audience.*Institutions of Higher Education.*</fromValue>
      <toValue>true</toValue>
  </replace>
  <replace fromField="Institutions_of_Higher_Education" regex="true">
      <fromValue>^(?!true)*$</fromValue>
      <toValue></toValue>
  </replace>
</tagger>

If your DC-ED.audience field is not a metadata field already extracted but is part of the body, you can do something similar with TextPatternTagger.

mitchelljj commented 9 years ago

Thanks for the information! So if you don't add the second section within ReplaceTagger which I believe replaces any case of not true with no value what will the field of "Institutions_of_Higher_Education" contain or will this field not display within those records?

essiembre commented 9 years ago

I have not tested your particular case, but I believe it will copy the content over as is if it could not perform a replace. So you are correct, the second one is to make sure it is blank when not "true". In fact, it maybe best to store "false" instead of blank in your case.