BritishGeologicalSurvey / stratigraph

Network stratigraphy through text mining
GNU Lesser General Public License v3.0
4 stars 0 forks source link

Filter by Formation when returning rock units by geochronology #22

Closed metazool closed 3 years ago

metazool commented 3 years ago

I added a SPARQL query against a Fuseki store to return linked rock units corresponding to geological age, optionally filtered to only include those that have Formation rank

This includes an integration test which depends on having the current BGS Lexicon Linked Data, and our Jurassic sample, both loaded into a local Fuseki database named stratigraph. This test also runs in CI, collecting the Jurassic Lexicon data from data.bgs.ac.uk and adding the .ttl file of text mined relations in this project.

HOWEVER the query is clearly off as the test shows it is returning fewer subjects when filtering harder, I worry I'm misunderstanding the data. Any feedback before we go any further - and especially improvements to the queries! would be appreciated

metazool commented 3 years ago

https://github.com/BritishGeologicalSurvey/stratigraph/runs/1536862159?check_suite_focus=true - you can finally see the failing test on the SPARQL queries here (the queries are in stratigraph/store.py)

rachelheaven commented 3 years ago

Note that the query will need to be amended slightly so that the objects of the upper and lower relationships are generalised to their parent formation units if they are members. That is a separate issue from the failing test though, which I will contrinue to look into

metazool commented 3 years ago

I could understand if we were missing data because we're only querying for subjects which have both upper and lower boundary relations, and there will be missing cases where there is only one. It might make more sense to collect all the subjects in the given era, optionally filtered by Formation type, regardless of whether they have any upper/lower links, and then optionally filter out the detached ones when we construct the networkx graph.... The query in the data-loading script alongside the integration test effectively does this (e.g. "give me all triples for everything in the Jurassic, no matter what it is"...

rachelheaven commented 3 years ago

test_store.py has invalid SPARQL syntax - will push a fix for that shortly

metazool commented 3 years ago

I am happy to merge this if you are @rachelheaven and @kerberpolis

The dotfile output from the API based on this is returning a large collection of nodes, but no edges, but there's quite a lot in here already (and also I would l like to do a small overhaul collecting up all the data.bgs.ac.uk namespace references to keep them all in stratigraph/ns.py