codeforamerica / pdfhook

A Python web application for converting PDF forms into PDF-filling APIs
https://pdfhook.herokuapp.com
MIT License
46 stars 24 forks source link

WIP: adding new pdftk wrapper + tests #41

Closed bengolder closed 8 years ago

bengolder commented 8 years ago

This branch is a work in progress

Changes:

bengolder commented 8 years ago

Now that all these tests are in place for parsing the form, I tried to run it on a file that had a broader set of fields with some more variations

It worked waayy better than I expected. It looks like we are a very short distance a way from supporting many more field types! Check out the field data that was extracted:

https://gist.github.com/bengolder/5ddaa3fb553743b6ab70

here is the pdf used: field_type_survey.pdf

I've written down some observations about how it handles fields here: https://github.com/codeforamerica/pdfhook/issues/28

In terms of handling unicode, this is an interesting case. Note how it parsed "Jalapeño" differently in the fdf vs the field data dump:

 'Check Box4': {'FieldFlags': '0',
                'FieldJustification': 'Left',
                'FieldName': 'Check Box4',
                'FieldStateOption': ['Jalapeño', 'Off'],
                'FieldType': 'Button',
                'FieldValue': 'Jalapeño',
                'fdf': {'escaped_name': 'Check Box4',
                        'name': 'Check Box4',
                        'name_span': (68, 78),
                        'value_template': '/Jalape#f1o',
                        'value_template_span': (52, 63)}},

And though it had trouble with "Jalapeño", it seems to handle unicode better at other points:

'MötleyCrüe': {'FieldFlags': '0',
                'FieldJustification': 'Left',
                'FieldName': 'MötleyCrüe',
                'FieldType': 'Text',
                'FieldValue': 'Just another text field',
                'fdf': {'escaped_name': 'MötleyCrüe',
                        'name': 'MötleyCrüe',
                        'name_span': (358, 368),
                        'value_template': '(Just another text field)',
                        'value_template_span': (328, 353)}},
'Text12': {'FieldFlags': '0',
            'FieldJustification': 'Left',
            'FieldName': 'Text12',
            'FieldType': 'Text',
            'FieldValue': '¡Ojalá!',
            'fdf': {'escaped_name': 'Text12',
                    'name': 'Text12',
                    'name_span': (573, 579),
                    'value_template': '(¡Ojalá!)',
                    'value_template_span': (559, 568)}}

Check out this example of a set of radio options:

'Group9': {'FieldFlags': '49152',
            'FieldJustification': 'Left',
            'FieldName': 'Group9',
            'FieldStateOption': ['0', '1', '2', 'Choice1', 'Choice2', 'Off'],
            'FieldType': 'Button',
            'FieldValue': '1',
            'fdf': {'escaped_name': 'Group9',
                    'name': 'Group9',
                    'name_span': (279, 285),
                    'value_template': '(Choice1)',
                    'value_template_span': (265, 274)}},

And this example of a dropdown:

'Dropdown9': {'FieldFlags': '393216',
               'FieldJustification': 'Left',
               'FieldName': 'Dropdown9',
               'FieldStateOption': ['apple',
                                    'apricot',
                                    'banana',
                                    'cranberry',
                                    'date',
                                    'fig',
                                    'grape',
                                    'lime',
                                    'mango',
                                    'orange',
                                    'peach',
                                    'raspberry',
                                    'tamarind'],
               'FieldType': 'Choice',
               'FieldValue': 'fig',
               'FieldValueDefault': 'apple',
               'fdf': {'escaped_name': 'Dropdown9',
                       'name': 'Dropdown9',
                       'name_span': (131, 140),
                       'value_template': '(fig)',
                       'value_template_span': (121, 126)}},
bengolder commented 8 years ago

Merging. But there are more tests to come, and eventually tasks.py will be deprecated.