Open metazool opened 4 years ago
We should add a local cache of the data collected from the SPARQL endpoint as it's slow/inefficient to have to collect and discard it every time when running the name matching from a script
Comment from @rachelheaven via Teams after review of missing names (linked in the comments on #12) We can tweak the similarity matcher to ignore the phrase starting "[Obsolete Name And Code" , to be case insensitive, to ignore whitespace (I'll check that wouldn't cause any unwanted matches), and to ignore things in brackets.
We'll move some of the workings of the internal
entity-resolver
project to this one. This offered links between extracted names essentially by fuzzy string matching on stemmed names. It was a straight port of the Java original but there are potentially better libraries / techniques in python we could be using instead.https://github.com/BritishGeologicalSurvey/geo-ner-model/releases/tag/v0.3 - this should now be a docker image with CoreNLP server which does our custom named entity extraction and we can now run in a pipeline with Github Actions. It doesn't do the linking though