digital-preservation / pronom

Simple Maven Artifacts for PRONOM Signatures
http://www.nationalarchives.gov.uk/PRONOM/Default.aspx
Other
8 stars 3 forks source link

Better Whitespace Handling #18

Open gleporeNARA opened 3 years ago

gleporeNARA commented 3 years ago

When developing signatures for text based formats it would be useful to have a built-in ability to manage whitespace, and potentially linebreaks as well.

Many programming languages are whitespace agnostic - whitepaces do not affect the processing of the program. Python is one exception.

Consider the following excerpts of formats in the Simple Game Format (https://www.red-bean.com/sgf/)

(
;GM[1]FF[3]

(;GM[1]FF[3]

( ;GM[1]FF[3]

Each file contains the same code, however, the first example has possible whitespace and a line break after the initial parentheses, the second example has no whitespace, and the third example has a single space after the semicolon.

Functionally, all three excerpts are valid (as they would be with HTML, Perl, etc.), but the PRONOM signatures for all three would be different.

I'm thinking of a new signature value which indicates "some number of blank spaces, tabs, and/or linebreaks here".

Does this make sense, or am I missing some easier method of creating signatures that cover all of the above possibilities (plus all those allowed in many text based formats)?