Fix SPARQL required for GO-CAM website resource files

dustine32 commented 2 years ago

Carrying on with the work to overcome timeout issues with SPARQL queries called by the GO-CAM API. Similar to how we improved the models-by-GP query in https://github.com/geneontology/api-gorest/issues/3, we've still got two queries that are essential for the GO-CAM website to function but currently timing out after 30 seconds:

QUERY 1: This one is meant to get a GO-CAM-to-GP lookup file: https://github.com/geneontology/api-gorest-2021/blob/480092ba3b74c31f30b27358f7069f4a9417f743/queries/sparql-models.js#L262-L293 The actual query after resolving the separator from the config.json:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT ?gocam   (GROUP_CONCAT(distinct ?identifier;separator="@@") as ?gpids)
             (GROUP_CONCAT(distinct ?name;separator="@@") as ?gpnames)
WHERE 
{
    GRAPH ?gocam {
        ?gocam metago:graphType metago:noctuaCam .
        ?s enabled_by: ?gpnode .    
        ?gpnode rdf:type ?identifier .
        FILTER(?identifier != owl:NamedIndividual) .
        FILTER(!contains(str(?gocam), "_inferred"))
    }
    optional {
        ?identifier rdfs:label ?name
    }
    BIND(IF(bound(?name), ?name, ?identifier) as ?name)
}
GROUP BY ?gocam

QUERY 2: This one is meant to get a GO-CAM-to-GO-term lookup file: https://github.com/geneontology/api-gorest-2021/blob/480092ba3b74c31f30b27358f7069f4a9417f743/queries/sparql-models.js#L219-L259 Raw query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE 
{

GRAPH ?gocam {
?gocam metago:graphType metago:noctuaCam  .
        ?entity rdf:type owl:NamedIndividual .
?entity rdf:type ?goids
    }

    VALUES ?goclasses { BP: MF: CC:  } . 
    # rdf:type faster then subClassOf+ but require filter           
    # ?goids rdfs:subClassOf+ ?goclasses .
?entity rdf:type ?goclasses .

# Filtering out the root BP, MF & CC terms
filter(?goids != MF: )
filter(?goids != BP: )
filter(?goids != CC: )

# then getting their definitions
?goids rdfs:label ?gonames .
?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)

@balhoff @kltm Any ideas how we can speed these up to return results in under 30 seconds? They don't need to run crazy fast as they typically only execute when triggered by a GO release (so ~once a month).

balhoff commented 2 years ago

@dustine32 for Query 1, if you can do the grouping on the client side, this will complete in 18 seconds:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT DISTINCT ?gocam ?identifier ?name
WHERE 
{
  GRAPH ?gocam {
        ?gocam metago:graphType metago:noctuaCam .
  }
  FILTER NOT EXISTS {
    ?gocam prov:wasDerivedFrom ?asserted_cam .
  }
  GRAPH ?gocam {
    ?s enabled_by: ?gpnode .    
    ?gpnode rdf:type ?identifier .
    FILTER(?identifier != owl:NamedIndividual) .
  }
  OPTIONAL {
    ?identifier rdfs:label ?label
  }
  BIND(COALESCE(?label, ?identifier) AS ?name)
}

dustine32 commented 2 years ago

@balhoff Oh sweet! We can probably add a step here to handle the grouping of the results. Thanks!

balhoff commented 2 years ago

@dustine32 here is an 11.5 sec version of query 2:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE 
{
  GRAPH ?gocam {
    ?gocam metago:graphType metago:noctuaCam  .
  }
  FILTER NOT EXISTS {
    ?gocam prov:wasDerivedFrom ?asserted_cam .
  }
  GRAPH ?gocam {
    ?entity rdf:type owl:NamedIndividual .
    ?entity rdf:type ?goids
  }
  VALUES ?goclasses { BP: MF: CC:  } . 
    # rdf:type faster then subClassOf+ but require filter           
    # ?goids rdfs:subClassOf+ ?goclasses .
  ?entity rdf:type ?goclasses .

  # Filtering out the root BP, MF & CC terms
  filter(?goids != MF: )
  filter(?goids != BP: )
  filter(?goids != CC: )

  # then getting their definitions
  ?goids rdfs:label ?gonames .
  ?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)

lpalbou commented 2 years ago

Query2

I tried both versions of the query 2 and didn't find a speed improvement (old query: 27s then 34s on second run; new query: 29s then 39s). As shown, the time greatly varies based on when the server receives the query.

It seems that the timeout of the rdf endpoint was increased (up to 60s would probably be a good idea for now ?), so I also increased the timeout of the GO-CAM API itself: https://github.com/geneontology/api-gorest-2021/pull/3 . If you merge this PR, the query 2 seems to run and this would solve the cache created on AWS/lambda as it uses https://api.geneontology.xyz/models/go.

Query1

For the query 1, I still have a server timeout at 30s.. unsure why this is not the case for query 2 ? Maybe some config to check on RDF server ? Indeed, removing the grouping on RDF side, gets a much faster query (10s). Note this is the query used to create the GPs cache on AWS/lambda: https://api.geneontology.xyz/models/gp . @dustine32 remember cloud9 ?

Notes

If you want to continue to use a cache and avoid those timeouts, I would suggest using blazegraph runner during the release and store the files on the GO S3. Improving query performance is just a temporary fix as more GO-CAMs will be created. Just be sure to object in S3 (gocam-goterms.json is 10.4mb and 1.4mb compressed) - example
Indexing GO-CAMs would solve that issue and many others, but to be used by 3rd party sites (eg Alliance), GOLr would need to be https. You could still use the GO API or GO-CAM API (https) as a proxy to deliver https responses from GOLr
More to the point, @kltm those caches were created in the first place to enable client-side GO-CAM search at a time Ben's API didn't exist (hence why we are loading all terms, all gps for all go cams). Now that we have proper server-side search, these caches could/should probably be deprecated and benefit from server-side search @tmushayahama. If you do, the rest of the page only needs data for 10 models and this was always extremely fast through pagination

Happy holiday season to all ! 🎄🎉

dustine32 commented 2 years ago

Whoa, thanks again @lpalbou for all the advice! I'm now leaning towards your first note suggestion (using blazegraph runner during the release) but of course I also have to try the easy way out short-term.

Commit 5fe0a4b applies @balhoff's fix for Query1 (/models/gp) and moves handling of "group by gocam" results outside of the query, reusing @lpalbou's super-handy mergeResults function that was just laying there. Results are returned from the API in around 10 seconds.

I tried applying @balhoff's new Query2 but still ran into a timeout issue while testing the lambda locally:

Function 'GOREST' timed out after 30 seconds

Then I bumped this timeout from 30 to 60 sec in the template.yml: https://github.com/geneontology/api-gorest-2021/blob/3f2e9958286a63c24cd8ce599c762e56b40f5eff/template.yml#L21 This change at least got me to the next error:

Response payload size (10785926 bytes) exceeded maximum allowed payload size (6291556 bytes).

Looks like this 6MB limit is tied to an unchangeable AWS Lambda limit. There are some workarounds such as having the API immediately store the response payload in S3 then returning an S3 URL. This miiight work for us since our goal is to get it into S3 anyway, but it probably won't work for external users (then again, this route has been broken for a while so...). Also, the effort to implement this workaround might as well be spent coding blazegraph-runner calls into the release pipeline. Tagging @kltm.

lpalbou commented 2 years ago

Glad if it helps Dustin 🙂 . I do think a longer term solution would be blazegraph runner.. but in the mean time this may/should work. What I am really puzzled about is.. how come we reach a 6Mb payload limit ? That’s a lot, what are we sending ? From memory /models/go or /models/gp would worst case scenario send list of gocam ids.. and by default already does it for all.. so I am missing something here ?

ps: the “Winston” article was a lot of fun. I love AWS but sometimes there are hard constraints that can really cause issue (eg code pipeline can’t target an existing GH repo 😅)

kltm commented 2 years ago

[Note: documentation for manual hack of file update/upload while we work things out: https://docs.google.com/document/d/18vYy9sZq-dyjYWW0mnw3XpXRJjlI7pbQWvMlSSdXdjA/edit#heading=h.tzx1g6nhmgtd .]

kltm commented 2 years ago

Closing in favor of https://github.com/geneontology/pipeline/issues/265

geneontology / api-gorest-2021