CIRCSE / LEMLAT3

Morphological analyzer and lemmatizer for Latin.
http://www.lemlat3.eu/
25 stars 2 forks source link

Treatment of punctuation #19

Open Stormur opened 5 years ago

Stormur commented 5 years ago

I have noticed that punctuation marks apart from the hyphen - are not analyzed by LEMLAT, not even as unknown wordforms in the unk file (where "-" lands). However, when e.g. feeding a list of wordforms, one per line, to LEMLAT, it would be better to have the possibility to retrieve all of them in either of the two output files.

Also, LEMLAT automatically splits a string where a ' appears, creating two wordforms that are subsequently analyzed, without this being mentioned in the inline output message. there should be some option to change this behaviour and to make LEMLAT analyze each token as it is. Since this also happens with "." , it is very relevant for the treatment of abbreviations, which are very often tokenized as "T." or "F.", to distinguish them from the occurrences of the isolated letters "T" or "F".

gersh0m commented 5 years ago

The provided application it is NOT intended to be in ANY WAY a text analyzer it is rather a batch processor of word-forms. Actually a quick-and-dirty tokenizzation (based on the separation on any non word character and '-') step it is performed but just for convenience, a full lemmatizzation pipeline it is always advisable! Still you are right: it should be possible to analyze a string "as it is".