cksource-archive / quail-enhancement

Quail research ☠☠This repository has been deprecated ☠☠
0 stars 0 forks source link

documentAbbrIsUsed - better matching for abbreviations #1

Open mlewand opened 9 years ago

mlewand commented 9 years ago

Problem

There are few matching problems.

Diactric Characters

It strips diactric characters due to RegExp replacement. That results with BOŚ being recognized as BO. That might be two totally different things.

Abbreviation With Mixed Case

Currently this assessment matches only full capitalized abbreviations. Often times we're working with abbreviations that involves mixed casing. I'd love it to match abbrs like GoF, DfE, PGNiG.

That might sound tricky, because there are some funny names like GitHub, oEmbed, MathJax and we can't allow these to be considered as abbreviation. So solution for this is that more than 50% of "word" (abbreviation) length must be uppercased. That would eliminate majority of false matches.

Matching Abbreviation With Numbers

I think that it's actually too much :) Numbers are most commonly used in l33tspeak shortcuts. Examples are:

Currently abbreviation assessment checks whole document, and reports all the abbrevs found, that are not bound to any particular element.

Instead we need to provide more precission, showing each abbreviation occurrence. End user will need to accept each occurrence of the abbreviation.

More Resources:

mlewand commented 9 years ago

Possible traps: We need to watch out with matching for cases when whole text is written with uppercase.

It is a case that legal notes are written uppercased, so we need to watch out for that.