box / viewer.js

A viewer for documents converted with the Box View API
Apache License 2.0
336 stars 99 forks source link

Extracting internal page links from PDF #156

Open patbegg opened 9 years ago

patbegg commented 9 years ago

Hi, At present, when converting a PDF that contains links that have been created in the PDF they are extracted and displayed in the 'pagelinks' layer in the viewer. If you want to create an internal pagelink within the viewer you set the href to '#page-6', for example', go jump to page 6 of the document. However if there are links created in the PDF (as invisible rectangles, which works for email snd weblinks) and the weblink is set to '#page-6' then the links are not extracted. I have also tried adding the links as 'go to a page in the document' links, but these don't get extracted either.

Is it possible to create links in the PDF that link to internal pages, that will then be extracted during the conversion process?

EDIT: I can confirm that if you set the internal link value to 'http://#page-5' it will extract the links. But obviously the links have 'http://' prepended to them when what we want is just '#page-5' to jump to an internal page in the viewer.

Thanks, Pat

lakenen commented 9 years ago

I'm not familiar with the proper way to author PDF internal links, but the conversion should see them if they are authored properly. The #page-{n} that you see in the viewer is actually being created by viewer.js, and is not necessarily the original href of the internal link.

Here's an example of how to do this with Microsoft Word (click the download button to see the original .docx file). https://view-api.box.com/1/sessions/a6732dda0e8244b5bfb2965418a7cdd0/view

patbegg commented 9 years ago

I think this must be to do with how your converter engine works. In the original word doc you sent the link has a href of '#Page1'. If I add a link in a PDF with the href '#Page1' it doesn't get extracted during the conversion from PDF to SVG/HTML/CSS. However if I add the same link in a Word (.docx) file it DOES extract the link. If I prepend 'http://' to the link in the PDF it WILL extract the link, but then they don't function correctly in the converted document.

Are you able to change the conversion engine so it extracts links that are internal links i.e. with a href of #Page3, for example, instead of discarding them as invalid links?

lakenen commented 9 years ago

In the word doc I sent, Page1 was a bookmark I explicitly created in the document. I am not sure how to create those bookmarks in PDFs. I'll loop in the conversion team and get back to you.

On Tuesday, December 30, 2014, patbegg notifications@github.com wrote:

I think this must be to do with how your converter engine works. In the original word doc you sent the link has a href of '#Page1'. If I add a link in a PDF with the href '#Page1' it doesn't get extracted during the conversion from PDF to SVG/HTML/CSS. However if I add the same link in a Word (.docx) file it DOES extract the link. If I prepend 'http://' to the link in the PDF it WILL extract the link, but then they don't function correctly in the converted document.

Are you able to change the conversion engine so it extracts links that are internal links i.e. with a href of #Page3, for example, instead of discarding them as invalid links?

— Reply to this email directly or view it on GitHub https://github.com/box/viewer.js/issues/156#issuecomment-68415103.

patbegg commented 9 years ago

Is there any news on this? It's a very common thing for us to have clients add links to the PDF or add them in InDesign and then convert to PDF. At present if i add a link to a PDF with the url '#page=4, before conversion, when converted I get this in the info.json: file://localhost/tmp/viewapi/workspace/convert-cb36ae56894f476aa43dfd437cd64b1b/#page-3

Can this be made uniform in some way rather than us using a regex to find the links?

lakenen commented 9 years ago

@patbegg do you have an example document that exhibits this behavior? You can send it along to api@box.com or link a downloadable view api session URL here if you don't mind it being publicly accessible.