Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Bug in com.norconex.importer.handler.tagger.impl.DOMTagger? #25

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi,

I'm using norconex-collector-http-2.5.1 with lib/norconex-importer-2.6.0-SNAPSHOT.jar

In the crawler configuration file below, you'll find that DESCRIPTION and DESCRIPTION-TEST3 have exactly the same definition/selector. However, in the resulting "meta" file, I get different results for these tags:

DESCRIPTION=Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  ^|~Find the most relevant companies for you^|~Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  
DESCRIPTION-TEST3=Plating and Surface Coating^|~Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd. Polymertal maintains state of the art R&D abilities that are dedicated to constantly develop and improve the products. The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating – resulting in the identification and development of a unique plating process. As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.  
DESCRIPTION-TEST2=Plating and Surface Coating
DESCRIPTION-TEST1=<div class\="company__short-description" style\="word-break\: keep-all;">\n  Plating and Surface Coating \n</div>^|~<div class\="company__description">\n <p></p>\n <p>Polymertal Ltd was established on the basis of an enterprise within Digispeech Ltd.&nbsp;Polymertal maintains state of the art R&amp;D abilities that are dedicated to constantly develop and improve the&nbsp;products.</p>\n <p></p> \n <p></p>\n <p>The vision was to develop a high quality solution to the ongoing problem of electro-magnetic and radio wave interferences between different electronic and communication systems. Through much research and development the understanding was reached that the solution lies in surface plating &\#x2013; resulting in the identification and development of a unique plating process.</p>\n <p></p> \n <p></p>\n <p>As a result, Polymertal has developed unique technologies for polymers surface treatment that allow two major abilities\: metal plating on plastic and composites materials, and surfaces treatment of 3D printer products.</p>\n <p></p> \n <p></p>\n <p>&nbsp;</p>\n <p></p>\n</div>

Why do I get 8 times the values for DESCRIPTION?

<?xml version="1.0" encoding="UTF-8"?>
<!-- Testing crawler for Test -->

<httpcollector id="Test">

  <!-- Decide where to store generated files. -->
  <progressDir>./test/progress</progressDir>
  <logsDir>./test/logs</logsDir>

  <crawlers>
    <crawler id="Test">

      <robotsTxt ignore="true" />

      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://finder.startupnationcentral.org/c/polymertal</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./test</workDir>

      <!-- TODO: Use several threads: set to 5??? -->
      <numThreads>1</numThreads>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>  <!-- TODO: Set to 2??? -->

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Crawl only companies pages -->
      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include">
          https?://finder\.startupnationcentral\.org/c/[a-z0-9_+-]+
        </filter>
      </referenceFilters>

      <!-- Document extraction/manipulation -->
      <importer>

        <preParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
                  <!--sourceCharset="UTF-8"-->
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION-TEST1"
                 overwrite="true"
                 extract="outerHtml" />
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION-TEST2"
                 overwrite="true"
                 extract="ownText" />
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION-TEST3"
                 overwrite="true"
                 extract="text" />
            <dom selector="div[class~=company__main]>div[class~=company__short-description],   div[class~=company__main]>div[class=company__description]" toField="DESCRIPTION"
                 overwrite="true"
                 extract="text" />
          </tagger>
        </preParseHandlers>
      </importer>

<!-- Basic committer, for the record -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./test/crawledFiles</directory>
      </committer>
<!-- -->

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 8 years ago

This is a case where the same field is defined twice in different case. You define "DESCRIPTION" in your config, but "description" exists in the document. Then you have values stored for both and since the metadata are case insensitive by default when you retrieve, you get the combined values of both.

Simply make your field "description" in lower case in your config, or even better, give it a more unique name to avoid collision. As an alternative, you can also use the CharacterCaseTagger to change the case of field names.

liar666 commented 8 years ago

OK Thanks. Since I'm a Unix developer, I have a tendency to always consider that things are case sensitive. That's why I originally used ALL_CAPS for my tags ; I thought they would thus not interfere with the already existing tags. I've solved the problem by using your suggestion to use more unique names to avoid collision. However, it is quite difficult to guarantee unicity of "my" names, when I don't know what kind of meta-tags the original author of the website might use...

essiembre commented 8 years ago

Good point about sometime being though to avoid collision. One trick I personally like to do is come up with a prefix for all my own tags that's relevant to the project I am working on, or the client name. Like "acme-description", "acme-whatever", etc. If you find it too problematic, we may consider a feature request that would allow optional prefixing of all metadata extracted from document parsing. This could greatly help eliminate collision too.

liar666 commented 8 years ago

AFAIC I'm quite paranoïd about user doing bad things with my code, so if it would be me, I would have segregated user-generated stuff from mine :) Since the user can Copy/Rename application-level-tags to his/her namespace that should not cause any problem. Whatever, thanks for the efforts you spend to explain, it makes the software very easy to use :)