Use larger spaCy model - Githubissues

oliviadargel commented 2 years ago

User story

As a NLP component developer
I want to see which provider supplies the information I seeuse a larger spacy model
So that I can provide better information on locations

Acceptance criteria

a larger model for spacy is used
Dockerfile is adjusted (should download the model that is actually used)

Definition of done

Code was reviewed by another person
GitHub CI runs successfully
Feature is merged into main branch

oliviadargel commented 2 years ago

This probably does not make sense, as only using de_core_news_md or de_core_news_lg (instead of de_core_news_sm) results in failing NLP tests. The md- and lg-model only perform correct if the input is lowercase, which is not nice as the result in Frontend is then shown in lowercase, too. A work around for this would be using the build-in capitalize()-function, a disadvantage still would be that locations with a "-" are not correctly capitalized (e.g. the input "Schleswig-Holstein" would result in "Schleswig-holstein"). Another solution would be, that the input is processed twice by the md- or lg-model, once how the user entered the request and once in lowercase. It is searched for locations in the lower case input (generally like it is done currently) but if a location is found, we use the token with the same index from the user input. This would be more time consuming and therefore should be discussed in the NLP team. In any case, the use of the md- or lg-model result in at least one failing test, because one location is not recognized that the sm-model recognizes.

[1] Link to all models [2] Short explanation to spaCy models

oliviadargel commented 2 years ago

The NLP team has agreed to keep the small model de_core_news_sm for the time being.

amosproj / amos2021ws01-geo-data-search

Use larger spaCy model #208

User story

Acceptance criteria

Definition of done