NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
188 stars 41 forks source link

Automatically add metadata to Hugging Face Hub repos when uploading projects #793

Open juhoinkinen opened 3 weeks ago

juhoinkinen commented 3 weeks ago

With this PR, when running annif upload:

Closes #790.

The metadata includes these:

language:
- <language-code tags automatically obtained from the uploaded projects>
tags:
- annif   # custom tag
pipeline_tag: text-classification  # HFH tag

The Model Card text content is very minimal; it has just the repo name as the heading and info about how to download projects from the repo, see an example in https://huggingface.co/juhoinkinen/Annif-models-upload-testing.

juhoinkinen commented 3 weeks ago

About @osma's suggestions in https://github.com/NatLibFi/Annif/issues/790#issuecomment-2137376118:

For example it could include the Annif version used for training, the backend, vocabulary name and size, possibly some of the hyperparameters / configuration settings as well.

sonarcloud[bot] commented 3 weeks ago

Quality Gate Passed Quality Gate passed

Issues
6 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.65%. Comparing base (3b5f7a1) to head (125565e).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #793 +/- ## ======================================= Coverage 99.64% 99.65% ======================================= Files 91 91 Lines 6817 6886 +69 ======================================= + Hits 6793 6862 +69 Misses 24 24 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

juhoinkinen commented 3 weeks ago

@CodiumAI-Agent /review

CodiumAI-Agent commented 3 weeks ago

PR Reviewer Guide 🔍

⏱️ Estimated effort to review [1-5] 3
🧪 Relevant tests Yes
🔒 Security concerns No
⚡ Key issues to review Possible Bug:
Ensure that the upsert_modelcard function handles cases where project language data might be missing or malformed. The current implementation assumes that proj.vocab_lang is always available and valid.
Data Integrity:
The merging of languages in upsert_modelcard should handle duplicates and potential case sensitivity issues to avoid incorrect language tags in the Model Card.
juhoinkinen commented 3 weeks ago

Possible Bug: Ensure that the upsert_modelcard function handles cases where project language data might be missing or malformed. > The current implementation assumes that proj.vocab_lang is always available and valid.

Good point by the AI, but I think the project language is always set if this point is reached...?