NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
191 stars 41 forks source link

Debug ValueError by Voikko analyzer #738

Open juhoinkinen opened 8 months ago

juhoinkinen commented 8 months ago

This aims to help to debug why the ValueError by Voikko analyzer arises (#737).

When this is merged, the latest Docker image could be deployed to ai.dev.finto.fi, which could then be used to investigate the situation.

Is there anything else that could be logged, in addition to the triggering word, @osma?

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (ff1d32c) 99.67% compared to head (7560fff) 99.67%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #738 +/- ## ======================================= Coverage 99.67% 99.67% ======================================= Files 89 89 Lines 6404 6429 +25 ======================================= + Hits 6383 6408 +25 Misses 21 21 ``` | [Files](https://app.codecov.io/gh/NatLibFi/Annif/pull/738?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/analyzer/analyzer.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/738?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYW5hbHl6ZXIvYW5hbHl6ZXIucHk=) | `100.00% <100.00%> (ø)` | | | [annif/analyzer/voikko.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/738?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYW5hbHl6ZXIvdm9pa2tvLnB5) | `100.00% <100.00%> (ø)` | | | [tests/test\_analyzer.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/738?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9hbmFseXplci5weQ==) | `100.00% <100.00%> (ø)` | | | [tests/test\_analyzer\_voikko.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/738?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9hbmFseXplcl92b2lra28ucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

osma commented 8 months ago

I think it would be helpful to be able to log more contextual information, but this might be a bit difficult since it would require accessing the call stack. At least in the traceback in #737 it seems that the error got triggered via the MLLM backend. So in that case it would be good to log the sentence that was being tokenized and maybe even the whole document text.

sonarcloud[bot] commented 8 months ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

juhoinkinen commented 8 months ago

Now the text being tokenized (full document) should be logged too.

juhoinkinen commented 8 months ago

I build a Docker image from this branch, deployed it to ai.dev.finto.fi and started sending suggest requests continuously with Annif-client (with contents from Kirjastonhoitaja dataset documents). Hopefully the error occurs and we get some more clues about it.

I thought using some fuzzing tool, but this was an easy way.

Maybe this PR does not even need to be merged to main, if the root cause of the error and a fix to it can be found.

osma commented 8 months ago

I think that text extracted from random PDFs, ideally using different text extraction tools, might be a better (more fuzzy) data source. But this is a good start, let's see if it will trigger the bug.