Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

DOMTagger and <head><script> #57

Closed liar666 closed 7 years ago

liar666 commented 7 years ago

On the page: https://web-ast.dsi.cnrs.fr/l3c/owa/personnel.infos_admin?p_numero_sel=1361736 If I use a crawler with:

          <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="head script:eq(2)" toField="SCRIPT-HTML"
                 overwrite="true"
                 extract="outerHtml" />  <!-- diff -->
            <dom selector="head script:eq(2)" toField="SCRIPT-TEXT"
                 overwrite="true"
                 extract="text" /> <!-- diff -->
         </tagger>

I get the correct HTML code in SCRIPT-HTML, but nothing in SCRIPT-TEXT. Is that OK?

essiembre commented 7 years ago

<script> content is considered data, not text. So changing text for data will do it.

liar666 commented 7 years ago

OK. Thanks. I never noticed how big has the list of "extract options" has grown :) https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/DOMTagger.html