Port the old CenoPDF Annual Census PDF Form

ghachey commented 3 weeks ago

CenoPDF is based on closed source technology, is no longer supported, clunky to use and will become increasingly difficult to maintain. But the Annual Census PDF Survey is an important data collection tool of the Pacific EMIS project still in use in at least one country and possibly new countries might want it. So it would be wise to port this tool to an open source one that will integrate with the Pacific EMIS core app with as little change as possible. LaTeX was identify as the best tool to produce high quality PDF forms but it does come with some efforts and challenges to overcome. This issue documents some of the early analysis on this work before diving right into it.

ghachey commented 3 weeks ago

Seems the latex hyperref package supports the creation of PDF Form fields. A small sample was created to assess the feasibility further. You don’t see the field IDs in overleaf, but when you save the PDF and load in Acrobat they are there.

The key thing is to export the data as xfdf (or xml at least) which I’ll try out with your sample but should not be an issue I think.

ghachey commented 3 weeks ago

Unfortunately struck a problem with extracting data from the sample – needs a bit of explaining

In PDF specification, form fields are hierarchical. Most PDF Editors it seems allow you to express this by using dots in the form field name to separate the hierarchy steps (e.g. Survey.HtGiven) is a field of partial name HtGiven which is a child of parent Survey.nThis can go as deep as you like. When you export this to XFDF format, the fields are nested in the resulting XML:

This can go as deep as you practically need. For example, in the Grid of Enrolments, we get:

Where the field name Enrol.D.00.04.M represents Enrolment>Data>Column 00>Row 04>Males. However, when I use overleaf, with hyperref on this:

\documentclass{article}
\usepackage{hyperref}

\begin{document}
  \begin{Form}
    \begin{tabular}{l}
      \TextField[name=me.name]{Name} \\\\
      \TextField[name=me.father]{FatherName} \\\\
      \TextField[name=h.First]{First} \TextField[name=h.Last, mappingname=h.Last]{Last} \TextField[name=h.Middle]{Middle} \\\\
    \end{tabular}
  \end{Form}
\end{document}

I download the PDF, and extract the XFDF from that, I get:

Why this matters: Currently when we get the hierarchical XFDF representing the entire survey, we open that XML and iterate through all the Field children of the Fields node. Each of these top level children is identified by its name, and that node contains all the data collected in some logical part of the form (ignore the bookmarks):

In the C# code, these are iterated in PdfSurvey.Process

Which , based on the Name of the field node, select the appropriate stored proc and passes it the entire nested node of Fields.

So this processing is flexible and easy to extend or adapt, but its not going to work without modification if we don’t get the hierarchical field data.

So the best solution would be to find some way to get overleaf/hyperref to provide the right result, I played around with a few options (mappingname parameter??) but with no luck, Brute force could be XSLT to fix up the errant XFDF file and re-establish the hierarchy?? Anyway if you have someone with expertise in this area they may have struck this before and have an answer – fingers crossed, because I still think it’s a great way to go with this,

ghachey commented 3 weeks ago

You may recall that , because we can’t use O2s PDF4Net in the open source project, I came upon and used the simpler free-to-use project (CodeProject licence) PdfFileAnalyser. (see class PdfForm in Pineapples.Models of the main Pacific EMIS backend). This does not have an intrinsic method for extracting XFDF, so it seems I have already built code for creating the hierarchical XFDF from a collection of Field code objects provided by PdfFileAnalyser.

ghachey commented 3 weeks ago

we do have a problem….

If you look at the PDF for a form decompressed to text markup:

Here we are looking at part of the Enrol markup, where Field names for data look like Enrol.D...M|F e.g. Enrol.D.02.01.M

The element with text D 18888 has Parent 18874, text Enrol, and so on down the hierarchy. Note that the text property /T of the object is in each case just the appropriate segment of the name.

This is all created when the PDF is created, and is driven by the presence of the . (dots) in the field name. In the PDFFileAnalyser, the Parent is an exposed property of each field, so the hierarchy is easily available when I make the XFDF from the fields collection of the Form.

BUT – it seems overleaf/hyperref does not respect this convention of using . to indicate hierarchy. When you look at the PDF created by overleaf, you see this:

No Parent node has been created, and the name has not been sliced up. Since no element has a parent, each element of the Fields collection appears as a top level element, and so the generated XFDF is not hierarchical.

Possible solutions: ????

An expert on overleaf/Latex/hyperref may know an answer to this – ie markup that will give the correct result (Ie including all the /Parent objects) when the PDF is compiled.

OR

I use Wondershare PDFElement as a low cost alternative to full Acrobat Pro.
Open in PDF Element a PDF generated by overleaf (which is fault – no /Parent node)
Go into Form Editing mode, and change the name of each form field.
Change the name of each form field back again to its original value.
Save the PDF
Then the /Parent properties are created. That is, PDF Element respects the . (dot) convention for hierarchical names and generates the hierarchy when saving an updated Field Name.

Which leaves open the possibility of finding some tool/operation that could “touch” the PDF and force a global recalculation of the Fields hierarchy…..

OR

Modify the XFDF generation in Pineapples to infer the hierarchy from the full names as they are presented in the /T node; e.g. understand that when processing the node Enrol.D.02.03.M

We need to find or create XML nodes for each step in the hierarchy. Fiddly, but not too difficult.

ghachey commented 3 weeks ago

The third approach is the preferred one (i.e. Modify the XFDF generation in PIneapples to infer the hierarchy from the full names as they are presented in the /T node)

PacificEMIS / pacific-emis-latexpdf-annual-survey-forms

Port the old CenoPDF Annual Census PDF Form #1