Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Special characters in binary files #238

Closed V3RITAS closed 8 years ago

V3RITAS commented 8 years ago

We are using the Norconex HTTP Collector to crawl HTML & binary files and send certain meta-fields and the content text to a Solr server.

Overall the extraction of text from binary files (PDF, Powerpoint,...) works pretty fine. But for some files we get strange looking characters like this:

html_entity_01 html_entity_02

When I look in the original binary file this seems to be characters like bullet points.

Is there a way to ignore or filter such special characters during parsing?

essiembre commented 8 years ago

Those are likely characters that do not translate well into real UTF-8 characters when extracted. Can you share an example document to help reproduce?

V3RITAS commented 8 years ago

Sure, I'm using this document for testing: www.hager.de/files/download/0/2713_1/0/BFT-210_ANLEITUNG_BEDIENUNG_02_2013.PDF

It's a manual for a door opener in German. ;-)

The document contains a lot of images and icons, but most of them seem to be ignored by the parser.

essiembre commented 8 years ago

To get rid of characters that are not rendered properly you can use the ReplaceTransformer.

For this to work you have to properly copy/identify the UTF-8 characters to replace. With your file, I could see the bullet squares in the PDF were coming across as empty boxes after being extracted, when opened in an editor that supports UTF-8. You can replace that with any character you like.

To replace, you can copy the character in your XML config as is, but you can also use its unicode value. For instance, using http://unicodelookup.com/ I was able to find out the unicode value for the square box was F06E (hex).

I successfully tested replacing all occurrences of that value with hyphens (could be anything you like), by adding the following to the importer section:

<importer>

  <postParseHandlers>

    <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
      <replace>
        <fromValue>\uF06E</fromValue>
        <toValue>-</toValue>
      </replace>
    </transformer>  

  </postParseHandlers>

</importer>

You can add as many replacements as you like.

Please confirm whether that works for you.

V3RITAS commented 8 years ago

Hi Pascal,

Thank you for your solution! I think this might work for us, although we will have to update the replacement list from time to time.

Thanks for your help!

essiembre commented 8 years ago

You are welcome!