flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.83k stars 2.09k forks source link

[Feature]: Latin NLP Model #3391

Open ch-sander opened 8 months ago

ch-sander commented 8 months ago

Problem statement

Classic languages such as Latin are mostly taking a back seat when it comes to NLP (for obvious reasons, though)

Solution

spaCy's model LatinCy has shown how nicely a Latin NLP model can perform. Is there any effort planned towards a Latin model within this project or any support in case a third party will aim for such a model?

Additional Context

No response

stefan-it commented 8 months ago

Hi @ch-sander ,

I think this is a very useful feature request! After having a look at the spaCy model for Latin on the Model Hub, for PoS Tagging the following repos from Universal Dependencies are used:

As far as I can see, only UD_Latin-LLCT is directly supported in Flair:

https://github.com/flairNLP/flair/blob/ddf3bb3e44f2a68b32d532ae5438d71c4125e4ab/flair/datasets/treebanks.py#L542-L562

The other datasets can easily be added to Flair (I assigned issue to me).

For NER I was unfortunately not able to find the training dataset, that was used for LatinCy. I should be located here, but it is currently not available. So I am pinging @diyclassics for help on NER :)

When these resources are available and integrated into Flair, it should be very easy to train models on that. E.g. PoS Tagging and NER models can be trained with LMs like Latin BERT as backbone.

ch-sander commented 8 months ago

This sounds awesome! Thanks!

It would be promising to also involve https://github.com/CIRCSE and their many efforts related to the LiLa project @passarom. If I'm right, they also included more Medieval Latin than @diyclassics's model.