UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

TryGetForm does not support field partial names with a "." #848

Closed vanillaFriday closed 5 months ago

vanillaFriday commented 5 months ago

PDF-XChange Editor offers a very handy possibility to "Create Multiple Copies..." of any form. The copies have the Syntax [field partial name].[vertical counter].[horizontal counter]

When loading such a document, GetAcroForm does not copy the fields with one or two "." in the name to the acroDictionary. A test pdf with those fields is attachted.

Changing the "." to "_" fixes the problem.

Please add support for fields with a single or multiple "." in the partial name.

test_doc.pdf

BobLd commented 5 months ago

@vanillaFriday I'm working on it. Can you tell me which property/field should have [field partial name].[vertical counter].[horizontal counter]?

Any way I can see the values in acrobat reader?

vanillaFriday commented 5 months ago

@BobLd it should be the AcroFieldBase.Information.PartialName property. I currently have no adobe reader avaible on this machine. It's named "Field Name" in XChange.

BobLd commented 5 months ago

@vanillaFriday this is what I see in your sample doc image

Using the Winking Pdf Anaylzer app (not related to PdfPig), it seems the values are the same. image image

Either Winking Pdf Anaylzer (quite possible given how pdf work!) has the same problem as PdfPig, or the information is missing in your pdf.

Can you double check?

vanillaFriday commented 5 months ago

@BobLd I found a similar question/problem here with a appropriate answer (different project): https://github.com/empira/PDFsharp/issues/34

So maybe PDF XChange is creating childs and that's the reason for the "."? It looks as follows on my screen: grafik

BobLd commented 5 months ago

@vanillaFriday they are indeed children. If you want to see the reference you have on screen, you might want to use something like

public void Issue848()
{
    using (var document = PdfDocument.Open(IntegrationHelpers.GetDocumentPath("test_doc_issue_848"), ParsingOptions.LenientParsingOff))
    {
        document.TryGetForm(out var form);

        foreach (var field in form.Fields)
        {
            var str = GetText(field).ToArray();
        }
    }
}

private static IEnumerable<string> GetText(AcroFieldBase acro, string text = null)
{
    if (text is null)
    {
        text = acro.Information.PartialName;
    }
    else
    {
        text += "." + acro.Information.PartialName;
    }

    if (acro is AcroNonTerminalField nonTerminal)
    {
        foreach (var child in nonTerminal.Children)
        {
            foreach (var t in GetText(child, text))
            {
                yield return t;
            }
        }
    }
    else if (acro.Information.Parent.HasValue)
    {
       yield return text; // final
    }
}

Result for the first one: image