factsmission / synospecies

Using Plazi Data to find currently accepted scientific names
https://synospecies.plazi.org/
MIT License
5 stars 1 forks source link

Handling of failed requests #128

Closed nleanba closed 5 months ago

nleanba commented 6 months ago

Currently, if a SPARQL request fails, synospecies just ignores it, leading to inconsistent/missing results. This happens frequently for species for which many requests need to be made (due to lots of synonyms), e.g. T. rex.

Synospecies should handle this better.

nleanba commented 6 months ago

Indicate failures to the user (if no retry/retry failed)

Not easily possible, due to the separation of synospexies and synogroup

Retry requests (after a short delay)

Experimental findings in synogroup:

I will add the relevant code to synogroup, such that it retires (after 502) up to 4 times, starting with a 50ms gap, doubling the wait for each retry.

Downside of this approach is that it increases the number of requests made. Given that the core issue is that the server seems to struggle with receiving too many requests in too short a time frame, this is non-ideal.

@retog ideas on the last point, i.e. reducing the amount of queries sent are very welcome.

nleanba commented 6 months ago

Hmm. The delays don't work for synospecies becuase the 502 errors have no CORS headers, and thus the js code only gets an opaque error.

This should be fixable by removing the check for status code. I don't think this check was very neccesary anyways.

nleanba commented 5 months ago

Potential idea to reduce the amount of requests made:

Currently, for each taxon concept, there is one request gathering all relevant treatments. I think it should be possible to reduce this:

  1. We could combine them into one big query which gets all treatments of all synonyms, (and what tcs&tns they define/augment/deprecate/treat/cite), and then in JS assign the treatments to the synonyms.

    • number of requests is reduced by (Number of Synonyms) - 1
    • For any given treatment, either all or none of the info is present
    • need to check if this is significantly slower (might depend on the endpoint)
  2. Alternatively, we might combine the “get Treatments” step into the “get next round of synonyms” step, gathering all the treatments of a synoynym together with the synonym itself.

    • maybe just get three collated porperties contatining a list of treatment uris, and gather the treatment metadata (authors, date, title, images) together with the material citations
    • number of requests is reduced by (Number of Synonyms)
    • at least the knwoledge of number of treatments per synonym are known sooner, which would allow for some skeleton ui to appear faster, reducing the amount that content moves around as new data is loaded (each synonym can already reserve the vertical space needed for the treatments.
    • need to check if this is significantly slower (might depend on the endpoint)
nleanba commented 5 months ago

e.g SPARQL for 1.:

PREFIX cito: <http://purl.org/spar/cito/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT ?tc (group_concat(DISTINCT ?aug;separator="|") as ?augs) (group_concat(DISTINCT ?def;separator="|") as ?defs) (group_concat(DISTINCT ?dpr;separator="|") as ?dprs) (group_concat(DISTINCT ?cite;separator="|") as ?cites) WHERE {
  <http://taxon-concept.plazi.org/id/Animalia/Sadayoshia_miyakei_Baba_1969> ((^treat:deprecates/(treat:augmentsTaxonConcept|treat:definesTaxonConcept))|((^treat:augmentsTaxonConcept|^treat:definesTaxonConcept)/treat:deprecates))* ?tc .
  OPTIONAL { ?aug treat:augmentsTaxonConcept ?tc . }
  OPTIONAL { ?def treat:definesTaxonConcept ?tc . }
  OPTIONAL { ?dpr treat:deprecates ?tc . }
  OPTIONAL { ?cite cito:cites ?tc . }
}
GROUP BY ?tc

(Variant without the group_concats seems significantly faster, needs investigation...

or

PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT ?treat ?how ?tc ?date (group_concat(DISTINCT ?creator;separator="; ") as ?creators) WHERE {
  <http://taxon-concept.plazi.org/id/Animalia/Sadayoshia_edwardsii_Miers_1884> ((^treat:deprecates/(treat:augmentsTaxonConcept|treat:definesTaxonConcept))|((^treat:augmentsTaxonConcept|^treat:definesTaxonConcept)/treat:deprecates))* ?tc .
  ?treat (treat:augmentsTaxonConcept|treat:definesTaxonConcept|treat:deprecates|cito:cites) ?tc ;
          dc:creator ?creator ;
          ?how ?tc .
  OPTIONAL {
    ?treat treat:publishedIn/dc:date ?date .
  }
}
GROUP BY ?treat ?how ?tc ?date
nleanba commented 5 months ago

e.g SPARQL for 2., here for the deprecating synonyms:

# Get synonyms deprecating taxon and all relevant treatments
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT
  ?tn ?tc ?justification (group_concat(DISTINCT ?auth; separator=" / ") as ?authority) (group_concat(DISTINCT ?treat; separator="|") as ?treats)
WHERE {
  ?justification treat:deprecates <http://taxon-concept.plazi.org/id/Animalia/Sadayoshia_edwardsii_Miers_1884> ;
         (treat:augmentsTaxonConcept|treat:definesTaxonConcept) ?tc .
  ?tc <http://plazi.org/vocab/treatment#hasTaxonName> ?tn .
  OPTIONAL { ?tc dwc:scientificNameAuthorship ?auth }
  OPTIONAL {
    ?treat (treat:augmentsTaxonConcept|treat:definesTaxonConcept|treat:deprecates|cito:cites) ?tc .
  }
  OPTIONAL {
    ?treat (treat:citesTaxonName|treat:treatsTaxonName) ?tn .
  }
}
GROUP BY ?tn ?tc ?justification

or with distinguishing types of treatments:

# Get synonyms deprecating taxon and all relevant treatments
PREFIX cito: <http://purl.org/spar/cito/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX treat: <http://plazi.org/vocab/treatment#>
SELECT DISTINCT
  ?tn ?tc (group_concat(DISTINCT ?auth; separator=" / ") as ?authority) (group_concat(DISTINCT ?justification; separator="|") as ?justs) (group_concat(DISTINCT ?aug;separator="|") as ?augs) (group_concat(DISTINCT ?def;separator="|") as ?defs) (group_concat(DISTINCT ?dpr;separator="|") as ?dprs) (group_concat(DISTINCT ?cite;separator="|") as ?cites) (group_concat(DISTINCT ?trtn;separator="|") as ?trtns) (group_concat(DISTINCT ?citetn;separator="|") as ?citetns)
WHERE {
  ?justification treat:deprecates <http://taxon-concept.plazi.org/id/Animalia/Munida_edwardsii_Miers_1884> ;
         (treat:augmentsTaxonConcept|treat:definesTaxonConcept) ?tc .
  ?tc <http://plazi.org/vocab/treatment#hasTaxonName> ?tn .
  OPTIONAL { ?tc dwc:scientificNameAuthorship ?auth }
  OPTIONAL { ?aug treat:augmentsTaxonConcept ?tc . }
  OPTIONAL { ?def treat:definesTaxonConcept ?tc . }
  OPTIONAL { ?dpr treat:deprecates ?tc . }
  OPTIONAL { ?cite cito:cites ?tc . }
  OPTIONAL { ?trtn treat:treatsTaxonName ?tn . }
  OPTIONAL { ?citetn treat:citesTaxonName ?tn . }
}
GROUP BY ?tn ?tc

The latter is quite fast on either endpoint

nleanba commented 5 months ago

Also, turns out that Synogroup sends some requests multiple times due to a mismanagement of what synonyms are already being handled: this is fixed in https://github.com/plazi/synolib/pull/9, but for some context:

https://synospecies.plazi.org/#Doryphoribius+zyxiglobus makes 966 SPARQL requests to the selected backend (for a total of 14 taxon concepts), many of which are duplicates (I found one that was sent 45 times!)

The consolodation of queries will reduce the amount of queries setn. but being smarter about not sending duplicates will probably have a much bigger impact.

(https://github.com/plazi/synolib/pull/9 reduces the total number of queries sent for Doryphoribius zyxiglobus to 60.)