CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
14 stars 11 forks source link

Port the GBIF nub-ws to CoL as a module #1302

Open fmendezh opened 3 months ago

fmendezh commented 3 months ago

The checklistbank-nub-ws must be ported to Catalogue Of Life/Checklistbank following these considerations:

djtfmartin commented 3 months ago

Client libraries need to be Java 11.

https://docs.stackable.tech/home/stable/spark-k8s/index.html
v3.5.0 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11)
3.4.1 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11)
3.4.0 (Hadoop 3.3.4, Scala 2.12, Python 3.11, Java 11) (deprecated)
djtfmartin commented 3 months ago

Work in progress on this branch (matching-ws module):

https://github.com/CatalogueOfLife/backend/tree/matching-ws

Notes so far:

mdoering commented 3 months ago

for the mapDB classification store look into UsageCache and its implementations. That might be immediately reusable

djtfmartin commented 2 months ago

First pass version of the ported matching is running on backbonebuild machine - visible through gbif impact tool here:

https://www.dev.checklistbank.org/tools/gbif-impact?csv=all-new-53147.txt&colKey=53147

mdoering commented 2 months ago

Excellent! Is the code on some branch to look at? There are some changes reported by the impact tool. Should there be none? Is the current one being compared the gbif prod live nub matching service?

djtfmartin commented 2 months ago

Yes ! - WIP code is on this branch:

https://github.com/CatalogueOfLife/backend/tree/matching-ws

I still need to bring across the test suite

Should there be none?

Yes, i think so. A lot of the differences seem to be in authorships (about 80-90%), but there are some differences in matching for around 500 cases. If you filter by changes in family, you can see some oddities that i think are bugs in the port.

Is the current one being compared the gbif prod live nub matching service?

Yes. The comparison is with prod occurrence data, which i'm assuming has been matched to the data in https://www.checklistbank.org/dataset/53147/about

djtfmartin commented 2 months ago

Ive switched to using a new pipeline for comparison (see #1313). Ignoring matches with differences in authorships and focussing on where there is a change in higher taxonomy (i..e family), there is approx 350 differences:

https://www.dev.checklistbank.org/tools/gbif-impact?csv=all-services-53147.txt&colKey=53147

Working through these issues now.

Image

mdoering commented 2 months ago

Some of those might even be better results. The Typhlocybinae example is odd. It was matched rightly before, but to sth without a classification. Now it only hits the family but matches with the input classification well. Sometimes it's hard to decide what result we actually want...