empira / PDFsharp-1.5

A .NET library for processing PDF
MIT License
1.28k stars 588 forks source link

Internal links within the PDF are not preserved #84

Closed icnocop closed 5 years ago

icnocop commented 5 years ago

When I use PDFSharp to open a PDF file and then save it to a different file, the internal links become broken; clicking on a link no longer performs any actions.

Expected Behavior

I expected the actions on the internal links within the PDF to be preserved so that clicking on an internal link goes to the same page as before.

Actual Behavior

The internal links are broken; clicking on a link no longer performs any actions.

Steps to Reproduce the Behavior

  1. Download and extract PDFSharpLinksTest.zip to a directory of your choice.
  2. Open PDFSharpLinksTest.sln in Visual Studio 2017.
  3. Build the solution in the Debug | Any CPU configuration.
  4. Run the TestMethod1 unit test:

        [TestMethod]
        public void TestMethod1()
        {
            string inputFilePath = "input.pdf";
            string outputFilePath = "output.pdf";
    
            using (var pdfDocument = PdfReader.Open(inputFilePath))
            {
                pdfDocument.Save(outputFilePath);
            }
        }
  5. Open bin\Debug\output.pdf and notice that the links on the first page are broken; clicking on a link no longer performs any actions.

I can open the input.pdf file included in the attached zip using Adobe Acrobat Reader DC v2019.010.20098 and the links on the first page work without issues.

I'm running on Windows 10 64-bit Version 1809 (OS Build 17763.292) and using Visual Studio 2017 Enterprise Version 15.9.8.

PDFsharp v1.51.5185-beta.

Any ideas?

Thank you.

icnocop commented 5 years ago

I also tried using the following versions of PDFSharp, but they still exhibit the same issue: 1.3.0 1.32.2602 1.32.3057 1.50.5147

StLange commented 5 years ago

PDF has a syntax element called name object, a slash followed by characters, e.g. /SomeName123.

Your sample file was created with Qt and contains names like

/file#3a#2f#2f#2fC#3a#2fWorkingTFS#2fNovaTeam#2fdev#2fManagement#20Server#2fSOURCE#2fHelp#2f_pdf#2f_raw#2f_pdf#2farticles#2fIntroduction.html or /#10#14#3a#25#bbK#96#a4#ad#db#fd#3d#daio#e6#adY#8f#9c

PDFsharp parses the #xx superfluously as hex values and fails to reproduce them correctly in the output file.

To fix the bug we now parse name objects literally. As a quick fix replace function ScanName in Lexer.cs by the following code and the links in the output file are working.

public Symbol ScanName()
{
    Debug.Assert(_currChar == Chars.Slash);

    _token = new StringBuilder();
    while (true)
    {
        char ch = AppendAndScanNextChar();
        if (IsWhiteSpace(ch) || IsDelimiter(ch) || ch == Chars.EOF)
            return _symbol = Symbol.Name;

#if true_
        if (ch == '#')
        {
            ScanNextChar(true);
            char[] hex = new char[2];
            hex[0] = _currChar;
            hex[1] = _nextChar;
            ScanNextChar(true);
            // TODO Check syntax
            ch = (char)(ushort)int.Parse(new string(hex), NumberStyles.AllowHexSpecifier);
            _currChar = ch;
        }
#endif
    }
}

We will fix this bug in the next release.

icnocop commented 4 years ago

Thank you.

I just wanted to follow up to see if "the next release" with this fix is available, and if so, where I can download it?

Thank you.

jacobpoldenzuhlke commented 2 years ago

This is definitely not fixed. @StLange is this issue going to be resolved?

@icnocop did you manage to find a workaround in the end?