Links appearing in pdf documents

ghost commented 5 years ago

Hi,

When parsing pdf documents which contain hyperlinks, the links end up in the extracted content.

I'm using http_collector 2.8.1 and a simple pdf document (created from word) which has the word test which is a hyperlink to www.bbc.co.uk.

If parse the document the link also appears alongside the text once extracted.

If I parse the document using the tika-app jar which is shipped with the collector I get the same behaviour as seen through the http_collector :

# java -jar tika-app-1.16.jar -t test.pdf
test
http://www.bbc.co.uk/

If I then run it through the pdfbox-app version which I believe the http_collector is using then I don't see the link, which is the behaviour I require :

# java -jar pdfbox-app-2.0.7.jar ExtractText test.pdf
# cat test.txt
test

Is there any way I can get the http_collector/importer to parse the document and not generate the hyperlinks in the content ?

I originally thought this was an encoding issue with the crawler until I realised I could run the document via the tika/pdfbox jars to reproduce the behaviour.

Many thanks.

essiembre commented 5 years ago

Not sure if this can easily be turned off, but could stripping them after parsing be a simple enough solution?

I recommend you use ReplaceTransformer as a post-parse handler. Assuming your URLs will be followed by white space, you should be able to strip them with something like this (not tested):

<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
    <replace>
        <fromValue>https?://\S+(\s|$)</fromValue>
        <toValue></toValue>
    </replace>
</transformer>

ghost commented 5 years ago

Pascal,

Thanks for the update, and your excellent software.

I did already implement this as a workaround, and it's good to see your solution matches as confirmation of the correct approach.

I geuss another option could be to call out to the pdfbox jar externally in some way but this sort of defeats the point of using the embedded functionality and the capability of tika to manage the conversion.

If you have any further insight into how tika might be configured to drive pdf box in the way required that would be useful however as I have workaround the you can close the ticket as this seems like it will be a satisfactory solution.

Many thanks.

essiembre commented 5 years ago

I will update if I come across a solution. If you ever find out from the PDFBox community, let me know and I will try to integrate.

Norconex / importer

Links appearing in pdf documents #93