Open fmendezh opened 3 months ago
Client libraries need to be Java 11.
https://docs.stackable.tech/home/stable/spark-k8s/index.html
v3.5.0 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11)
3.4.1 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11)
3.4.0 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11) (deprecated)
Work in progress on this branch (matching-ws
module):
https://github.com/CatalogueOfLife/backend/tree/matching-ws
Notes so far:
Rank
and other vocab enums)for the mapDB classification store look into UsageCache and its implementations. That might be immediately reusable
First pass version of the ported matching is running on backbonebuild machine - visible through gbif impact tool here:
https://www.dev.checklistbank.org/tools/gbif-impact?csv=all-new-53147.txt&colKey=53147
Excellent! Is the code on some branch to look at? There are some changes reported by the impact tool. Should there be none? Is the current one being compared the gbif prod live nub matching service?
Yes ! - WIP code is on this branch:
https://github.com/CatalogueOfLife/backend/tree/matching-ws
I still need to bring across the test suite
Should there be none?
Yes, i think so. A lot of the differences seem to be in authorships (about 80-90%), but there are some differences in matching for around 500 cases. If you filter by changes in family, you can see some oddities that i think are bugs in the port.
Is the current one being compared the gbif prod live nub matching service?
Yes. The comparison is with prod occurrence data, which i'm assuming has been matched to the data in https://www.checklistbank.org/dataset/53147/about
Ive switched to using a new pipeline for comparison (see #1313). Ignoring matches with differences in authorships and focussing on where there is a change in higher taxonomy (i..e family), there is approx 350 differences:
https://www.dev.checklistbank.org/tools/gbif-impact?csv=all-services-53147.txt&colKey=53147
Working through these issues now.
Some of those might even be better results. The Typhlocybinae example is odd. It was matched rightly before, but to sth without a classification. Now it only hits the family but matches with the input classification well. Sometimes it's hard to decide what result we actually want...
The checklistbank-nub-ws must be ported to Catalogue Of Life/Checklistbank following these considerations: