Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Strip Characters in title #109

Closed bkisselbach closed 4 years ago

bkisselbach commented 4 years ago

We need to trim down the title that a webpage has but I can't get it to work. The title has pipes ( | ) in it and we want to only keep the words to the left of the first pipe. I've tried the textbetween tagger like this:

.* ^ \|
bkisselbach commented 4 years ago
 <tagger class="com.norconex.importer.handler.tagger.impl.TextBetweenTagger" >
                       <restrictTo caseSensitive="false" field="title">.*</restrictTo>
                        <textBetween name="title">
                            <start>^</start>
                            <end>\|</end>
                        </textBetween>
                    </tagger>
essiembre commented 4 years ago

I can think of a few possible causes.

The restrictTo is meant to only apply a handler to certain documents, not fields. What you have right now is restricting the textBetween logic to all documents matching .* in their title (so all documents).

It should otherwise work. I suggest you put a DebugTagger just before yours to print out the title at that point, to confirm it is what you expect at that stage. It is possible for instance that you have more than one title value. This would show it. E.g.:

<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logFields="title" logLevel="INFO" />

You can add the same DebugTagger right after yours to see if ANY transformation occurred on that field.

You can also try the ReplaceTagger instead, like this (untested):

  <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace regex="true" fromField="title">
          <fromValue>^(.*?)\|.*</fromValue>
          <toValue>$1</toValue>
      </replace>
  </tagger>
bkisselbach commented 4 years ago

Perfect. I added a little to remove the whitespace. ^(.?)\s|.

Thanks!