allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Add fontinfo to tokens without requiring word split #182

Closed rauthur closed 1 year ago

rauthur commented 1 year ago

This PR appends font information (font name and size) as metadata to 'tokens' on a Document without requiring tokens to be split on that information (i.e., "best" effort if a token contains many font names or sizes). The code subclasses the WordExtractor provided by PDFPlumber.

Currently, in a default configuration of PDFPlumberParser, tokens are already extracted with font name and size information (although it is discarded and only used for splitting). This could be added to the metadata as-is, however, the method used is the extra_attrs argument of PDFPlumber's extract_words which forces token splitting if font name and size does not match. I believe this is a bad default and further is not required (users can override this argument). The approach here guarantees this metadata will be captured.

Re: "best effort" in case of many name/size options for a token: I have maintained the logic of just taking the font name and size from the first character in the token. Another approach provides the min, max and average font sizes (or others) as well as a set of font names. This could be provided in the future or by adapting the append_attrs argument. For current use cases (section nesting prediction) this approach is sufficient.