bertvandepoel / snelSLiM

A linguistic set of tools in Go and web interface in PHP to do quick Stable Lexical Marker Analysis
GNU Affero General Public License v3.0
3 stars 0 forks source link

Prevent generation and processing of invalid frequency lists #38

Closed bertvandepoel closed 4 years ago

bertvandepoel commented 4 years ago

During testing, the following problematic XML in the MASC corpus popped up:

  <a xml:id="penn-N67107" label="tok" ref="penn-n147" as="anc">
    <fs>
      <f name="base" value="****************************************************************************&#10;      *****"/>
      <f name="msd" value="NN"/>
      <f name="string" value="****************************************************************************&#10;      *****"/>
    </fs>
  </a>

As entities are decoded, (newline) and (tab) will cause invalid frequency lists (.snelslim files after preparsing). This should be prevented by handling these specific cases and in general validating the file format.