arbeitsgruppe-digitale-altnordistik / Sammlung-Toole

A new look on Handrit.is data
https://arbeitsgruppe-digitale-altnordistik.github.io/Sammlung-Toole/
MIT License
0 stars 0 forks source link

feat: unify manuscript data #121

Closed BalduinLandolt closed 2 years ago

BalduinLandolt commented 2 years ago

resolves #38

kraus-s commented 2 years ago

Awesome! I'll have a closer look tomorrow.

BalduinLandolt commented 2 years ago

Awesome! I'll have a closer look tomorrow.

not yet there, I'm afraid... just wanted to see the test runs, so I drafted the PR. But it's taking shape... :)

codecov-commenter commented 2 years ago

Codecov Report

Base: 44.23% // Head: 47.91% // Increases project coverage by +3.67% :tada:

Coverage data is based on head (738737b) compared to base (36665dd). Patch coverage: 28.41% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #121 +/- ## ========================================== + Coverage 44.23% 47.91% +3.67% ========================================== Files 25 26 +1 Lines 1519 1509 -10 ========================================== + Hits 672 723 +51 + Misses 847 786 -61 ``` | [Impacted Files](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik) | Coverage Δ | | |---|---|---| | [tests/unit/lib/xml/test\_metadata.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-dGVzdHMvdW5pdC9saWIveG1sL3Rlc3RfbWV0YWRhdGEucHk=) | `100.00% <ø> (ø)` | | | [src/lib/xml/metadata.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-c3JjL2xpYi94bWwvbWV0YWRhdGEucHk=) | `8.12% <8.51%> (-0.81%)` | :arrow_down: | | [src/lib/xml/tamer.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-c3JjL2xpYi94bWwvdGFtZXIucHk=) | `18.54% <18.42%> (+5.37%)` | :arrow_up: | | [src/lib/datahandler.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-c3JjL2xpYi9kYXRhaGFuZGxlci5weQ==) | `21.38% <19.04%> (-1.24%)` | :arrow_down: | | [src/lib/database/database.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-c3JjL2xpYi9kYXRhYmFzZS9kYXRhYmFzZS5weQ==) | `35.13% <25.80%> (+0.46%)` | :arrow_up: | | [src/lib/database/db\_init.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-c3JjL2xpYi9kYXRhYmFzZS9kYl9pbml0LnB5) | `25.00% <33.33%> (-5.91%)` | :arrow_down: | | [src/lib/database/deduplicate.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-c3JjL2xpYi9kYXRhYmFzZS9kZWR1cGxpY2F0ZS5weQ==) | `41.81% <41.81%> (ø)` | | | [tests/unit/lib/xml/test\_tamer.py](https://codecov.io/gh/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik#diff-dGVzdHMvdW5pdC9saWIveG1sL3Rlc3RfdGFtZXIucHk=) | `95.00% <95.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=arbeitsgruppe-digitale-altnordistik)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

BalduinLandolt commented 2 years ago

@kraus-s the core feature is implemented, though it surly can still be improved. But the unification should be - at least to some extent - reasonable.
There is probably some clean-up still to do, just ignore that for now. ;-)

What I have not done at all, is seeing if we need junction tables for the unified manuscript data; and implementing ways to query that data. (If we should even still be able to query the un-unified data in the first place? maybe unified data is enough?)
I'm wondering if I should postpone that to a next PR, so nothing is blocked. What do you think?

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 26 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

BalduinLandolt commented 2 years ago

I now changed it, so that the search functions actually use the unified data.
Additionally I triaged a bug in the new XML processing, which meant that only 1/30 texts were found in the manuscript descriptions... but we'll have to have a very close look at the xml data extraction. I'm under the impression that considerable regressions have happened there.

kraus-s commented 2 years ago

Regressions in the xml extraction as in the new way we implemented lxml instead of beautiful soup? If so, I'll have a look

Balduin Landolt @.***> schrieb am So., 2. Okt. 2022, 16:52:

I now changed it, so that the search functions actually use the unified data. Additionally I triaged a bug in the new XML processing, which meant that only 1/30 texts were found in the manuscript descriptions... but we'll have to have a very close look at the xml data extraction. I'm under the impression that considerable regressions have happened there.

— Reply to this email directly, view it on GitHub https://github.com/arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121#issuecomment-1264661739, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANZXABDMPDNCB7BMR5BGSBLWBGOSJANCNFSM6AAAAAAQYECEJM . You are receiving this because you were mentioned.Message ID: <arbeitsgruppe-digitale-altnordistik/Sammlung-Toole/pull/121/c1264661739@ github.com>

BalduinLandolt commented 2 years ago

Regressions in the xml extraction as in the new way we implemented lxml instead of beautiful soup? If so, I'll have a look

yes... I can't pip point it yet... but I think we're skipping extremely many searches (sometimes with, sometimes without logging a warning) and accept 0 or "N/A" where previously we got real results.
The instance I actually found, was that due to some logic being placed in a wrong if-statement, we only found ca. 2k texts in manuscripts, instead of 60k, which resulted in a tiny junction table and thus not finding any results for most "search text by manuscript" searches. But that one is fixed.

But I don't think this is the highest priority right now.