UCREL / pymusas

Python Multilingual Ucrel Semantic Analysis System
https://ucrel.github.io/pymusas/
Apache License 2.0
30 stars 12 forks source link

Create a Spacy version of the rule based tagger #4

Closed apmoore1 closed 2 years ago

apmoore1 commented 3 years ago

We want the Spacy version to allow for multiple languages, following their language-specific factory setup. As the rule based tagger requires a lexicon resource, we need to load in data, to do this we are going to follow option 2: Save data with the pipeline and load it in once on initialization. Option 2 was chosen as it will allow us to ship the models without the user having to specify where the data has come from, we can state in the shipped models where the data has come from with the license for the data which will reflect the license of the model.

With Option 1 we would either have to create download functions that save data to the users file system or we would have to ship the data with the python package, this would then require us to have different licenses for the code and data even though they are in the same repository.

apmoore1 commented 3 years ago

Whether or not to add the data for each language through the pymusas package or wether it should be automatically downloaded when creating a resource.

apmoore1 commented 2 years ago

Whether or not to add the data for each language through the pymusas package or wether it should be automatically downloaded when creating a resource.

We are going to go for the approach of allowing the users to supply their own resources, of which they could download USAS lexicon files from the Multilingual USAS repository