NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
190 stars 41 forks source link

Fix crashing `index` command when targeted directory contains subject files #705

Closed juhoinkinen closed 1 year ago

juhoinkinen commented 1 year ago

The annif index command crashes when the targeted directory contains subject files (files with either *.tsv or *.key extensions):

annif index dummy-fi tests/corpora/archaeology/fulltext/
Traceback (most recent call last):
...
  File "/home/local/jmminkin/git/Annif/annif/corpus/subject.py", line 248, in from_string
    subject_ids.add(subject_index.by_uri(uri))
AttributeError: 'NoneType' object has no attribute 'by_uri'

The bug is due to this change done in PR #663.

I made the change because it seemed unnecessary and confusing to pass the subject index and language to DocumentDirectory when it is used only for obtaining documents to give suggestions to.

Reverting the change, i.e. passing subject index and language to DocumentDirectory is the simplest way to fix the bug, and on second thought maybe it also is better conceptually: corpora should have subject index and language defined, even if they are not used.

Also renames vars to better correspond their origin/usage: keyfile(name) -> subjfile(name).

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (2d1d3ce) 99.66% compared to head (b602ea2) 99.66%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #705 +/- ## ======================================= Coverage 99.66% 99.66% ======================================= Files 89 89 Lines 6288 6295 +7 ======================================= + Hits 6267 6274 +7 Misses 21 21 ``` | [Impacted Files](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/cli.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvY2xpLnB5) | `100.00% <100.00%> (ø)` | | | [annif/corpus/document.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvY29ycHVzL2RvY3VtZW50LnB5) | `100.00% <100.00%> (ø)` | | | [tests/conftest.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvY29uZnRlc3QucHk=) | `100.00% <100.00%> (ø)` | | | [tests/test\_backend.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9iYWNrZW5kLnB5) | `100.00% <100.00%> (ø)` | | | [tests/test\_cli.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9jbGkucHk=) | `100.00% <100.00%> (ø)` | | | [tests/test\_corpus.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9jb3JwdXMucHk=) | `100.00% <100.00%> (ø)` | | | [tests/test\_project.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/705?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9wcm9qZWN0LnB5) | `100.00% <100.00%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/NatLibFi/Annif/pull/705/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

sonarcloud[bot] commented 1 year ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
18.2% 18.2% Duplication

san-uh commented 1 year ago

Dear Annif colleagues, thank you for already working on this problem. We in the DNB have recently started using Annif 0.61.0 and have now noticed that we cannot currently use an "index" command with this version, so we have to switch back to 0.60.0. Just one question: Is a bugfix release 0.61.1 planned or will the solution come with 1.0? Thanks and greetings, Sandro

juhoinkinen commented 1 year ago

Hi @san-uh,

This bug realized only when the targeted directory contains files with either .tsv or .key extension (as far as I'm aware). Does index command work for you if you remove these files from the targeted directory (if there are them)?

san-uh commented 1 year ago

Hello @juhoinkinen, thank you very much for the explanation. I deleted the *.tsv files that were previously in the directory and the index command now works. So with this workaround annif 0.61.0 including index is usable. Thanks for the feedback! Sandro