amosproj / amos2021ws01-geo-data-search

Natural language and buzzword search on routing and places
MIT License
3 stars 1 forks source link

Use larger spaCy model #208

Closed oliviadargel closed 2 years ago

oliviadargel commented 2 years ago

User story

  1. As a NLP component developer
  2. I want to see which provider supplies the information I seeuse a larger spacy model
  3. So that I can provide better information on locations

Acceptance criteria

Definition of done

oliviadargel commented 2 years ago

This probably does not make sense, as only using de_core_news_md or de_core_news_lg (instead of de_core_news_sm) results in failing NLP tests. The md- and lg-model only perform correct if the input is lowercase, which is not nice as the result in Frontend is then shown in lowercase, too. A work around for this would be using the build-in capitalize()-function, a disadvantage still would be that locations with a "-" are not correctly capitalized (e.g. the input "Schleswig-Holstein" would result in "Schleswig-holstein"). Another solution would be, that the input is processed twice by the md- or lg-model, once how the user entered the request and once in lowercase. It is searched for locations in the lower case input (generally like it is done currently) but if a location is found, we use the token with the same index from the user input. This would be more time consuming and therefore should be discussed in the NLP team. In any case, the use of the md- or lg-model result in at least one failing test, because one location is not recognized that the sm-model recognizes.

[1] Link to all models [2] Short explanation to spaCy models

oliviadargel commented 2 years ago

The NLP team has agreed to keep the small model de_core_news_sm for the time being.