Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

TitleGeneratorTagger not detecting headers as expected #53

Closed OkkeKlein closed 7 years ago

OkkeKlein commented 7 years ago

When the header contains a period in a domain name or has 2 sentences (2 periods or 1 period and question mark) followed by newlines it is not used as title.

essiembre commented 7 years ago

This is done to increase the odds a single first line is indeed a title. What do you get if you set detectHeading to false? If you are interested in the first line only, you can also use the TextPatternTagger. Or if you are dealing with HTML docs, you can use the DOMTagger to extract the H1 or else.

This tagger is meant to be very simple and will never catch all use cases 100% as expected, but we can probably put a bit more intelligence. Can you please attach a sample file along with expected results?

OkkeKlein commented 7 years ago

Dit you fill in the form on website.com? Thank you. Will that be all?

are not recognized as headings due to the period.

My use case has all headings ending with question mark. But I suspect the same behavior if the question mark was a period.

essiembre commented 7 years ago

Is it one example or two? Because when I run the TitleGeneratorTagger against your two lines, I get this as expected: Did you fill in the form on website.com?. If I run it against your second line only, I get Thank you. as expected. Are you getting something else?

When detectHeading is true, the TitleGeneratorTagger first checks if the text starts with a line with just one sentence. If so it picks it as the title. Else, it tries to generate it the best it can, and if it fails, just grabs the first sentence. You can find more info on expected behavior in the javadoc.

OkkeKlein commented 7 years ago

I think it states the heading is determined by one or more newlines. And that is the behavior I need, but not seeing for these 2 examples.

essiembre commented 7 years ago

That's not how the behavior is described/working. Here is the relevant portion from the javadoc:

If isDetectHeading() returns true, this handler will check if the content starts with a stand-alone, single-sentence line (which could be the actual title). That is, a line of text with only one sentence in it, followed by one or more new line characters. The idea being that if there is more than one sentence, in most cases it should be considered a paragraph instead.

If what you are after is the first line instead (or first few lines), then the TextPatternTagger is your friend. Something like this (not tested):

<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
      <pattern field="title" group="1">
          \s*?(.*?)[\n\r].*
      </pattern>
  </tagger>

Does that work for you?

OkkeKlein commented 7 years ago

This still leaves the case of the domain name used in sentence and I guess we look at headings differently but TextPatternTagger gave me the option to resolve this. Thank you!