Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

TextPatternTagger issue #65

Closed aleha84 closed 7 years ago

aleha84 commented 7 years ago

As suggested in DOMtagger description it is better (in performance purposes) to use TextPatternTagger.

Most of the pages have a breadcrumb which i need to store in metadata field. Page markup is like this:

<div class="blocks-list__item hidden-xs">
                                                <div class="row">
                                                    <div class="col-sm-28"><ol class="breadcrumb">
<li><a href="/ru/"><span class="breadcrumb__home"></span></a></li> 
<li><a href="/s4">Рынки</a></li>
<li><a href="/ru/derivatives/">Срочный рынок</a></li>
<li><a href="/ru/members.aspx?tid=35">Участники</a></li>
<li><a href="/ru/derivatives/open-positions.aspx">Информация об открытых позициях</a></li>
</ol></div>

                                                </div>
                                            </div>

In importer.preParseHandlers added two sections:

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
    <restrictTo caseSensitive="false" field="content-Type">
        text/.*
    </restrictTo>
    <dom selector="ol.breadcrumb" toField="nav-breadcrumb-dom" extract="outerHtml" />
</tagger>
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
    <pattern field="nav-breadcrumb" caseSensitive="false">
        <![CDATA[<ol class="breadcrumb">(.*)</ol>]]>
    </pattern>
</tagger>

DOMTagger - works fine. TextPatternTagger - found nothing.

Is my confing wrong?

essiembre commented 7 years ago

I just tried your config snippet with your sample content and it worked fine for me. Which version are you using? You can try the latest snapshot in case something was fixed. Also, are you using any transformers before invoking the TextPatternTagger? Maybe the HTML is slightly modified beforehand?

Unless you are already facing performance issues, I suggest you use whatever approach you are more comfortable with. Unless your CPUs are maxed out, you can increase the number of threads and reduce the default delay. This should have a bigger performance impact than switching from DOMTagger to TextPatternTagger.

aleha84 commented 7 years ago

Importer used latest stable. No transformers before. There is no performance slowdown was detected, so i keep using DOMTagger. Thanx.

essiembre commented 7 years ago

OK I'll close then but if similar issues come up again for you with TextPatternTagger, do not hesitate to re-open.