CSIRO-enviro-informatics / loci-cache-scripts

A collection of tool to assist in the building of the loci-cache
0 stars 1 forks source link

Test linkset cc<>mb integration with an instance of loci cache #11

Closed jyucsiro closed 4 years ago

benjaminleighton commented 5 years ago

Spun up a new local test graphdb on internal machine using loci-cache-scripts/docker/cache modified to only load the cc<>mb linkset. No obvious errors on ingest. Counts like

     PREFIX dct: <http://purl.org/dc/terms/>
        PREFIX o: <http://www.w3.org/1999/02/22-rdf-syntax-ns#object>
        PREFIX p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate>
        PREFIX s: <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>
select  (count(distinct(?s)) as ?count)
        where { 
            ?stmt dct:isPartOf <http://linked.data.gov.au/dataset/mb16cc> .
            ?stmt o: ?o .
            ?stmt p: ?p .
            ?stmt s: ?s .
            ?s a <http://www.opengis.net/ont/geosparql#Feature>
        }

are significantly different "old cache": 505948, "new cache": 204177

dumped linkset triples like

     PREFIX dct: <http://purl.org/dc/terms/>
        PREFIX o: <http://www.w3.org/1999/02/22-rdf-syntax-ns#object>
        PREFIX p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate>
        PREFIX s: <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>
select  ?s ?p ?o
        where { 
            ?stmt dct:isPartOf <http://linked.data.gov.au/dataset/mb16cc> .
            ?stmt o: ?o .
            ?stmt p: ?p .
            ?stmt s: ?s .
            ?s a <http://www.opengis.net/ont/geosparql#Feature>
        } group by ?s ?p ?o

from both caches

investigations show that

$ grep -e '1000010000' sort-old.csv 
http://linked.data.gov.au/dataset/asgs2016/meshblock/10000100000,http://www.opengis.net/ont/geosparql#sfWithin,http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12102056
http://linked.data.gov.au/dataset/asgs2016/meshblock/11000010000,http://www.opengis.net/ont/geosparql#sfWithin,http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12107924
http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12102056,http://www.opengis.net/ont/geosparql#sfContains,http://linked.data.gov.au/dataset/asgs2016/meshblock/10000100000
http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12107924,http://www.opengis.net/ont/geosparql#sfContains,http://linked.data.gov.au/dataset/asgs2016/meshblock/11000010000
$ grep -e '1000010000' sort-new.csv 
http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12102056,http://www.opengis.net/ont/geosparql#sfContains,http://linked.data.gov.au/dataset/asgs2016/meshblock/10000100000
http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12107924,http://www.opengis.net/ont/geosparql#sfContains,http://linked.data.gov.au/dataset/asgs2016/meshblock/11000010000

some mb sfWithin cc in the old cache are not present in the new cache

benjaminleighton commented 5 years ago

I believe this is because the meshblocks dataset hasn't been loaded because an environment variable wasn't set. As a result preconditioning statements to build inverse relationships sfWithin statements did not create new triples. I've modified docker-compose to pass through the environment variables in my local test environment and I'm trying to recreate the cache.

benjaminleighton commented 5 years ago
     PREFIX dct: <http://purl.org/dc/terms/>
        PREFIX o: <http://www.w3.org/1999/02/22-rdf-syntax-ns#object>
        PREFIX p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate>
        PREFIX s: <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>
select  ?s ?p ?o
        where { 
            ?stmt dct:isPartOf <http://linked.data.gov.au/dataset/mb16cc> .
            ?stmt o: ?o .
            ?stmt p: ?p .
            ?stmt s: ?s .
            ?s a <http://www.opengis.net/ont/geosparql#Feature>
        } group by ?s ?p ?o

now showing identical number of rows from new and old caches.

$ grep -e '1000010000' sort-new.csv

now returning same 4 results as in old file

benjaminleighton commented 5 years ago

found a bug with https://github.com/CSIRO-enviro-informatics/loci-testdata/wiki/Test-case-B via https://github.com/CSIRO-enviro-informatics/loci-scripts tests. meshblocks are always considered inside catchments even when, in reality the relationship is the inverse and catchments are in meshblocks.

jyucsiro commented 5 years ago

@benjaminleighton could you clarify where the bug is in Test case B? or maybe provide some concrete examples?

benjaminleighton commented 5 years ago

Yes the particular problem is that tests/integration/test_loci_reapportioning_logic.py::test_loci_list_of_contains_matches_for_cc[loci-test-case-B-http:/linked.data.gov.au/dataset/asgs2016/meshblock/80044260000-matchingdata2] and other similoar test_loci_list_of_contains_matches_for_cc tests fail with incorrect expected counts. This in turn is because for a meshblock like http://linked.data.gov.au/dataset/asgs2016/meshblock/80044260000 there are no relationships in the current linkset that specify the contained catchments. @ashleysommer and I have been chatting about it and likely this is an error in the modifications I've made during querying out the postgis intersections converting type. I'm testing a better type conversion and there should be a new commit on the existing pull request if the fix works.

benjaminleighton commented 5 years ago

pyloci tests are passing with broad failures equivalent to db.loci.cat

benjaminleighton commented 5 years ago

I'm not convinced I tested the right linkset I think I confused things and uploaded the wrong old linkset so I'm reopening this.

benjaminleighton commented 5 years ago

pyloci tests are looking reasonable comparing db.loci.cat to an internal GraphDB populated with the linkset produced from automation however there are expected 1401514 statements but 1401590 are present.

I've queried all meshblock to catchment triples, dumped to ntriples, sorted and diffed them across, the new and old data and there are a small number of mb predicate cc relationships either present in the new data and not in the old and a smaller number present in the old and not in the new. Looking at a few of these I suspect they are all edge cases where minor differences in geometry or thresholding will in/exclude relationships between mb and cc.

For example

http://linked.data.gov.au/dataset/asgs2016/meshblock/10073060000,http://linked.data.gov.au/def/geox#transitiveSfOverlap,http://linked.data.gov.au/dataset/geofabric/contractedcatchment/12106406

is present in the new cache but not the old and is a small meshblock just off shore of Merimbula, it is near the border of the contracted catchment.

In the old cache but not the new cache is

http://linked.data.gov.au/dataset/asgs2016/meshblock/70017200000,http://www.opengis.net/ont/geosparql#sfContains,http://linked.data.gov.au/dataset/geofabric/contractedcatchment/9559071

where the meshblock is an offshore island in NT and the contracted catchment is marginally within it and likely has been thresholded out (i.e not within) in the new cache.

Overall there seem to be about 13 mb <> cc contains/within/transitiveOverlaps relationships present in the new cache and not in the old. There is only 1 of these kind of relationships (see above) present in the old cache but not in the new.

Possible explanations are: Difference in source data geometry (version changes), thresholding difference between automated scripts and those Ashley originally used, and differences in postgis functionality between versions Ashley used and those used in the automated scripts.

All in all this isn't too worrying and I think we can close this and move on.

jyucsiro commented 5 years ago

Thanks @benjaminleighton - given that there are minor differences, possibly because of versions of the data, lets go with the automated scripts as a repeatable point of truth. Suggest we refactor the tests and/or loci-testdata counts to match the automated scripts.