NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
195 stars 41 forks source link

Add language detection to REST API #659

Open UnniKohonen opened 1 year ago

UnniKohonen commented 1 year ago

This PR adds the ability to detect the language of a text to the REST API. The language detection uses the simplemma python library.

A POST method is added to the end-point /detect-language. It expects the request body to include a json object with the text whose language is to be detected and a list of candidate languages as their ISO 639-1 codes. For example:

{
  "candidates": ["en", "fi"],
  "text": "A quick brown fox jumped over the lazy dog."
}

The response is a json object with the format:

{
  "results": [
    {"language": "en", "score": 0.85},
    {"language": "fi", "score": 0.1},
    {"language": null, "score": 0.1} 
  ]
}

where the scores range from 0 to 1 and a null value is used for an unknown language.

Implements #631

codecov[bot] commented 1 year ago

Codecov Report

Base: 99.55% // Head: 99.53% // Decreases project coverage by -0.02% :warning:

Coverage data is based on head (34c2538) compared to base (45be34d). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #659 +/- ## ========================================== - Coverage 99.55% 99.53% -0.02% ========================================== Files 87 87 Lines 6006 6047 +41 ========================================== + Hits 5979 6019 +40 - Misses 27 28 +1 ``` | [Impacted Files](https://codecov.io/gh/NatLibFi/Annif/pull/659?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/rest.py](https://codecov.io/gh/NatLibFi/Annif/pull/659?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvcmVzdC5weQ==) | `97.05% <100.00%> (+0.69%)` | :arrow_up: | | [tests/test\_rest.py](https://codecov.io/gh/NatLibFi/Annif/pull/659?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9yZXN0LnB5) | `100.00% <100.00%> (ø)` | | | [annif/backend/omikuji.py](https://codecov.io/gh/NatLibFi/Annif/pull/659?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9vbWlrdWppLnB5) | `96.15% <0.00%> (-1.29%)` | :arrow_down: | | [tests/test\_config.py](https://codecov.io/gh/NatLibFi/Annif/pull/659?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9jb25maWcucHk=) | `100.00% <0.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

juhoinkinen commented 1 year ago

I see this is only a draft at the moment, but I took a glance and I think it would be better that the end-point name had hyphen instead of underscore (/detect_language - > /detect-language), it seems to be the preferred convention.

osma commented 1 year ago

Thanks for adding tests. A few more things:

  1. There should be tests for special cases, e.g. empty input, no candidate languages, unknown or malformed candidate languages...
  2. What if there are 20 candidate languages? How much memory will it take to handle the query? Will the memory be released afterwards? (I think not, Simplemma keeps models in memory AFAIK)
UnniKohonen commented 1 year ago

Right now, when making a request with no candidates or unknown candidates, the endpoint returns an empty list and when making a request with no text, it returns { "language": null, "score": 1 }. Does this make sense? Or should it always return the unknown language with score 1 when the input is incorrect?

I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates.

osma commented 1 year ago

Right now, when making a request with no candidates or unknown candidates, the endpoint returns an empty list

The good news is that it's not crashing! :grin:

My opinions on these cases:

  1. With no candidates given, I think it would be good to return a 400 Bad Request status code, with a descriptive error message.

  2. With unknown candidates (language codes that are unrecognized/unsupported by simplemma), I think a 400 Bad Request with a descriptive error message would also be appropriate.

There should be unit tests to check that these are indeed the results.

and when making a request with no text, it returns { "language": null, "score": 1 }. Does this make sense? Or should it always return the unknown language with score 1 when the input is incorrect?

I think this is OK, but there should also be a unit test for this special case.

I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates.

Great, thanks for testing! This is mostly what I suspected, although it's a surprise that accessing the endpoint again will free the memory. (Maybe this has to do with Flask running in development mode?)

This has some potential for DoS situations (intended or not), but I guess it's hard for us to avoid that given how Simplemma works. We could, however, limit the number of candidate languages per request to, say, at most 5. What do others think? @juhoinkinen ?

We could also try to work with the Simplemma maintainer if we want to change the way Simplemma allocates and releases models. For example, it could be possible to ask Simplemma to release the memory immediately or after a set period like 60 seconds after use.

juhoinkinen commented 1 year ago

I also tested making a request with all 48 possible language candidates. I had about 4 GB of free memory which was used amost completely after making the request. The memory isn't released automatically afterwards but it is freed if the endpoint is accessed again (simplemma is run again) or annif is restarted. Making other requests also slows down a lot after runnig simplemma with all candidates.

Great, thanks for testing! This is mostly what I suspected, although it's a surprise that accessing the endpoint again will free the memory. (Maybe this has to do with Flask running in development mode?)

This has some potential for DoS situations (intended or not), but I guess it's hard for us to avoid that given how Simplemma works. We could, however, limit the number of candidate languages per request to, say, at most 5. What do others think? @juhoinkinen ?

Limiting the number of candidate languages seems reasonable. If there is no simple way to make the limit configurable, 5 could be a good number for that.

We could also try to work with the Simplemma maintainer if we want to change the way Simplemma allocates and releases models. For example, it could be possible to ask Simplemma to release the memory immediately or after a set period like 60 seconds after use.

I noticed there is an issue in Simplemma repository about loading models to memory, which was opened just yesterday.

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

juhoinkinen commented 1 year ago

Just started to think, if some some testing could be performed also in tests/test_swagger.py. I don't remember just what more functionality does the tests in test_swagger.py cover than those in test_rest.py (if any).

juhoinkinen commented 1 year ago

Just started to think, if some some testing could be performed also in tests/test_swagger.py. I don't remember just what more functionality does the tests in test_swagger.py cover than those in test_rest.py (if any).

Background to the question of test_swagger.py vs test_rest.py: https://github.com/NatLibFi/Annif/pull/551#issuecomment-1011063687