Closed OkkeKlein closed 7 years ago
This is done to increase the odds a single first line is indeed a title. What do you get if you set detectHeading
to false? If you are interested in the first line only, you can also use the TextPatternTagger
. Or if you are dealing with HTML docs, you can use the DOMTagger to extract the H1 or else.
This tagger is meant to be very simple and will never catch all use cases 100% as expected, but we can probably put a bit more intelligence. Can you please attach a sample file along with expected results?
Dit you fill in the form on website.com? Thank you. Will that be all?
are not recognized as headings due to the period.
My use case has all headings ending with question mark. But I suspect the same behavior if the question mark was a period.
Is it one example or two? Because when I run the TitleGeneratorTagger
against your two lines, I get this as expected: Did you fill in the form on website.com?
. If I run it against your second line only, I get Thank you.
as expected. Are you getting something else?
When detectHeading
is true, the TitleGeneratorTagger
first checks if the text starts with a line with just one sentence. If so it picks it as the title. Else, it tries to generate it the best it can, and if it fails, just grabs the first sentence. You can find more info on expected behavior in the javadoc.
I think it states the heading is determined by one or more newlines. And that is the behavior I need, but not seeing for these 2 examples.
That's not how the behavior is described/working. Here is the relevant portion from the javadoc:
If
isDetectHeading()
returns true, this handler will check if the content starts with a stand-alone, single-sentence line (which could be the actual title). That is, a line of text with only one sentence in it, followed by one or more new line characters. The idea being that if there is more than one sentence, in most cases it should be considered a paragraph instead.
If what you are after is the first line instead (or first few lines), then the TextPatternTagger is your friend. Something like this (not tested):
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
<pattern field="title" group="1">
\s*?(.*?)[\n\r].*
</pattern>
</tagger>
Does that work for you?
This still leaves the case of the domain name used in sentence and I guess we look at headings differently but TextPatternTagger gave me the option to resolve this. Thank you!
When the header contains a period in a domain name or has 2 sentences (2 periods or 1 period and question mark) followed by newlines it is not used as title.