BritishGeologicalSurvey / stratigraph

Network stratigraphy through text mining
GNU Lesser General Public License v3.0
4 stars 0 forks source link

Add entity linking & improve the name -> URL matching #2

Open metazool opened 4 years ago

metazool commented 4 years ago

We'll move some of the workings of the internal entity-resolver project to this one. This offered links between extracted names essentially by fuzzy string matching on stemmed names. It was a straight port of the Java original but there are potentially better libraries / techniques in python we could be using instead.

https://github.com/BritishGeologicalSurvey/geo-ner-model/releases/tag/v0.3 - this should now be a docker image with CoreNLP server which does our custom named entity extraction and we can now run in a pipeline with Github Actions. It doesn't do the linking though

metazool commented 3 years ago

5 does most of this and has had a surface review. Rather than store a file of name to link mappings locally, it uses the SPARQL endpoint on data.bgs.ac.uk to collect the data before running. It will still bear improvement on the name matching techniques.

metazool commented 3 years ago

We should add a local cache of the data collected from the SPARQL endpoint as it's slow/inefficient to have to collect and discard it every time when running the name matching from a script

metazool commented 3 years ago

Comment from @rachelheaven via Teams after review of missing names (linked in the comments on #12) We can tweak the similarity matcher to ignore the phrase starting "[Obsolete Name And Code" , to be case insensitive, to ignore whitespace (I'll check that wouldn't cause any unwanted matches), and to ignore things in brackets.