CIRCSE / LEMLAT3

Morphological analyzer and lemmatizer for Latin.
http://www.lemlat3.eu/
25 stars 2 forks source link

Common words and forms missing from LEMLAT #6

Open nevenjovanovic opened 7 years ago

nevenjovanovic commented 7 years ago

We have tested LEMLAT on a corpus of classical Latin texts from a university reading list. The corpus contains some 23,700 words and 8,538 different word forms: Terence's Adelphoe, Horace's Odes Bk. 1, Tibullus Bk. 1, Seneca's Letters Bk. 1 (all editions from the PerseusDL collection). Beside various forms of personal names (and some typos in our sources), there were 40 word forms not recognized by LEMLAT; a tiny percent of all forms -- but the list is below. Some reasons for not recognizing the forms seem to be orthographical (ë, omitted -p- in emta, demsi, oe in foeneraret; words joined instead of separated -- illiusmodi). Some have to do with meter in comedy - the elided -n', from -ne, is regularly not recognized by LEMLAT. Some missing forms are fairly common: norimus, nosse.

I propose that the forms from the list below be added to the LEMLAT database.


adteruisse
audistin
coëmisse
demseris
demsi
egon
emta
emtae
emtam
foeneraret
haecine
hancine
hocine
hoscine
illan
illiusmodi
ipsus
lucu
men
norimus
nosse
nossem
nostin
numquidnam
poëta
poëtae
posthaec
propediem
quamobrem
quamprimum
quandoquidem
quorundam
quotannis
refrixerit
sumtuosa
tamdiu
tantummodo
tercentenas
tetigin
tun
passarom commented 7 years ago

Thank you, Neven. Very helpful, indeed.

The forms you propose to include in the DB are mostly "exceptional forms" (in LEMLAT's terminology) of already recorded lemmas. See the documentation for the details of such forms. Basically, such forms are not segmented by LEMLAT and their analysis is fully hard coded in a specific table of the db (called "forme_ecc").

I will check each of these forms in the lexicographic sources of LEMLAT (Georges, OLD, Laterculi + Onomasticon of Forcellini). If they are there, they will be included in the db (this might be the case of "nosse"). If not, we will have to take a decision about, as we want to separate in the "lessario" table those forms not reported by the sources of LEMLAT (there is a specific column for such information: src).

Thank you again!

Marco

Il giorno 21 ago 2017, alle ore 11:49, Neven Jovanović notifications@github.com ha scritto:

We have tested LEMLAT on a reading list classical Latin corpus of some 23,700 words and 8,538 different word forms: Terence's Adelphoe, Horace's Odes Bk. 1, Tibullus Bk. 1, Seneca's Letters Bk. 1 (all editions from the PerseusDL collection). Beside various forms of personal names (and some typos in our sources), there were 40 word forms not recognized by LEMLAT; a tiny percent of all forms -- but the list is below. Some reasons for not recognizing the forms seem to be orthographical (ë, omitted -p- in emta, demsi, oe in foeneraret; words joined instead of separated -- illiusmodi). Some have to do with meter in comedy - the elided -n', from -ne, is regularly not recognized by LEMLAT. Some missing forms are fairly common: norimus, nosse.

I propose that the forms from the list below be added to the LEMLAT database.

adteruisse audistin coëmisse demseris demsi egon emta emtae emtam foeneraret haecine hancine hocine hoscine illan illiusmodi ipsus lucu men norimus nosse nossem nostin numquidnam poëta poëtae posthaec propediem quamobrem quamprimum quandoquidem quorundam quotannis refrixerit sumtuosa tamdiu tantummodo tercentenas tetigin tun

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

passarom commented 4 years ago

Per apportare questa modifica, si tenga conto che si tratta di una modifica che impatta LEMLAT e solo potenzialmente anche il lemmario di LiLa.

La lista di Neven è una lista di forme non riconosciute da LEMLAT. Non sono lemmi e non impattano il lemmario LiLa, se non nei due casi qui sotto descritti. Se si vuole apportare una modifica per risolverle, bisogna apportarla in LEMLAT: e le tipologie di modifica sono molteplici, e.g. nuove forme eccezionali, nuovi les con codles flessivi, aggiunte di a_gra, etc.

Impatto sul lemmario di LiLa:

Flavio è la persona più giusta per apportare modifiche a LEMLAT, perché ha ben chiaro il quadro complessivo delle tabelle del lemlat_db. Assieme a lui deciderò un momento (auspicabilmente a fine emergenza COVID-19) dedicato a una campagna di: (a) aggiornamento di LEMLAT con i nuovi lemmi inseriti nel lila_db (identificati con un codice src che li trova in modo non ambiguo); (b) aggiornamento di LEMLAT con le modifiche necessarie per far fronte alla lista di Neven.

Ricordo che si inseriscono in LiLa nuovi lemmi/wr solo se si realizza una di queste condizioni: