Closed V3RITAS closed 8 years ago
Those are likely characters that do not translate well into real UTF-8 characters when extracted. Can you share an example document to help reproduce?
Sure, I'm using this document for testing: www.hager.de/files/download/0/2713_1/0/BFT-210_ANLEITUNG_BEDIENUNG_02_2013.PDF
It's a manual for a door opener in German. ;-)
The document contains a lot of images and icons, but most of them seem to be ignored by the parser.
To get rid of characters that are not rendered properly you can use the ReplaceTransformer.
For this to work you have to properly copy/identify the UTF-8 characters to replace. With your file, I could see the bullet squares in the PDF were coming across as empty boxes after being extracted, when opened in an editor that supports UTF-8. You can replace that with any character you like.
To replace, you can copy the character in your XML config as is, but you can also use its unicode value. For instance, using http://unicodelookup.com/ I was able to find out the unicode value for the square box was F06E (hex).
I successfully tested replacing all occurrences of that value with hyphens (could be anything you like), by adding the following to the importer section:
<importer>
<postParseHandlers>
<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
<replace>
<fromValue>\uF06E</fromValue>
<toValue>-</toValue>
</replace>
</transformer>
</postParseHandlers>
</importer>
You can add as many replacements as you like.
Please confirm whether that works for you.
Hi Pascal,
Thank you for your solution! I think this might work for us, although we will have to update the replacement list from time to time.
Thanks for your help!
You are welcome!
We are using the Norconex HTTP Collector to crawl HTML & binary files and send certain meta-fields and the content text to a Solr server.
Overall the extraction of text from binary files (PDF, Powerpoint,...) works pretty fine. But for some files we get strange looking characters like this:
When I look in the original binary file this seems to be characters like bullet points.
Is there a way to ignore or filter such special characters during parsing?