Closed bertvandepoel closed 4 years ago
During testing, the following problematic XML in the MASC corpus popped up:
<a xml:id="penn-N67107" label="tok" ref="penn-n147" as="anc"> <fs> <f name="base" value="**************************************************************************** *****"/> <f name="msd" value="NN"/> <f name="string" value="**************************************************************************** *****"/> </fs> </a>
As entities are decoded, (newline) and (tab) will cause invalid frequency lists (.snelslim files after preparsing). This should be prevented by handling these specific cases and in general validating the file format.
During testing, the following problematic XML in the MASC corpus popped up:
As entities are decoded, (newline) and (tab) will cause invalid frequency lists (.snelslim files after preparsing). This should be prevented by handling these specific cases and in general validating the file format.