AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

Configurably reject all-records queries #350

Closed ansell closed 9 months ago

ansell commented 5 years ago

We cannot support many concurrent all-records queries because of infrastructure budget limitations. One source of long queries are queries for everything text:*, q=*, q=*:*. It would be useful to optionally blacklist these queries in biocache-service to prevent them being executed instead of letting them get to the timeout and users fail to get results in those cases anyway. User interfaces could be configured to detect these calls and reject them preemptively, but there are many UIs directly accessing biocache-service, so the blacklisting needs to be done at this level to be effective.

ansell commented 5 years ago

These queries also have the potential to affect bie-index performance through calls to

2019-01-04 11:19:46,389 [http-bio-8080-exec-373] INFO au.org.ala.biocache.service.SpeciesLookupRestService (SpeciesLookupRestService.java:396) - Requesting: https://bie.ala.org.au/ws/guid/batch?q=*

ansell commented 5 years ago

Another outage today was caused by this qid, which might as well be requesting everything even though it has two basic filters on it that don't cut it down very far https://biocache-ws.ala.org.au/ws/qid/1559327224101

ansell commented 5 years ago

Another short outage today caused by a similar query https://biocache-ws.ala.org.au/ws/qid/1567141555695

ansell commented 4 years ago

Another outage today caused by text:*

ansell commented 4 years ago

Also need to configurably block all of these generic spatial portal filters unless other more specific filters are included as biocache-service/solr cannot handle them and people need to be taught to be more specific in their queries:

https://biocache-ws.ala.org.au/ws/qid/1585004472884

{
  "rowKey": "1585004472884",
  "q": "geospatial_kosher:true",
  "displayString": "Spatial validity:Spatially valid",
  "wkt": "",
  "lastUse": 1585101191084,
  "fqs": [
    "-occurrence_status_s:absent",
    "longitude:[112 TO 154]",
    "latitude:[-44 TO -9]"
  ],
  "maxAge": -1
}
ansell commented 4 years ago

Also need a way of configurably blocking bogus queries caused by other ALA products that have coding bugs which are not being fixed in the relevant products such as the following:

text:undefined
text:null
qid:undefined
qid:null
nickdos commented 3 years ago

@alexhuang091 this issue is relevant to my post on Slack about NOT failing-over to *:* when no q param is detected.

I think a search of code for *:* might be needed to find all cases. Also note the configurably element to this change - needs to be implemented via feature flag so LA users can turn it off or we can back it out via config.

alexhuang091 commented 3 years ago

Is q='*:*'the only way to search for all so we only need to handle this case? I mean what else can be in q to search for all

apart from

text:undefined
text:null
qid:undefined
qid:null

which effectively returns large amount of records

nickdos commented 3 years ago

I think a first change could be to change the way biocache currently handles an empty q param - which is to substitute in *:* for it. I think we do want allow users to enter *:* as there are reports on the dashboard which use this and blocking it would break those.

This would need changes in both biocache-service and biocache-hubs, as they both have code that does that substitution. Searching GH code for *:* is a good place to start.

So those cases above - I think we'd want to invalidate those requests and push back a BAD_REQUEST http code.

alexhuang091 commented 3 years ago

I think a first change could be to change the way biocache currently handles an empty q param - which is to substitute in *:* for it. I think we do want allow users to enter *:* as there are reports on the dashboard which use this and blocking it would break those.

This would need changes in both biocache-service and biocache-hubs, as they both have code that does that substitution. Searching GH code for *:* is a good place to start.

So those cases above - I think we'd want to invalidate those requests and push back a BAD_REQUEST http code.

Hi Nick, biocache-service doesn't substitute empty q with *:*,

Commented out this line in biocache-hubs causing q='' in searchParams

    if (!requestParams.q) {
       // requestParams.q = "*:*"
    }

The result is this:

截屏2021-04-15 下午4 24 20
nickdos commented 3 years ago

Bai found an issue where the q param was getting stripped off some POST calls to biocache-service (between SP sending it and arriving at biocache-service, via Nginx) and the results from biocache-service was to show all records, so it must be happening there somewhere. Note: hubs not involved at all in that case.

nickdos commented 3 years ago

Here we go https://biocache-ws.ala.org.au/ws/occurrences/search?pageSize=9

returns all records ad shows query: "?q=*%3A*", in response.

EDIT: I think SOLR might be doing this via the defaultQuery option... in ansible

alexhuang091 commented 3 years ago

Hi Nick,

https://github.com/AtlasOfLivingAustralia/biocache-service/blob/cf41784f1f57e39f7af7232ebd360ee966c34f6d/src/main/java/au/org/ala/biocache/dto/SearchRequestParams.java#L44

    @RequestMapping(value = {"/occurrences/search.json*","/occurrences/search*"}, method = RequestMethod.GET)
    public @ResponseBody SearchResultDTO occurrenceSearch(SpatialSearchRequestParams requestParams,
                                                          @RequestParam(value="apiKey", required=false) String apiKey,

If there's no q in the url, default *:* will be used. This happens upon service receiving the request.

Try this https://biocache-ws.ala.org.au/ws/occurrences/search?pageSize=9&q=


Biocache-hubs is a bit special, it has its own definition of SearchRequestParams and q='' is default, and it replaces any '' with *:*, so Biocache-hubs never sends empty q,


So the conclusion is:

  1. if no q, *:* will be used.
  2. if have a q, even if it's empty, it will be used.
nickdos commented 3 years ago

Agree. So I think we need 2 changes:

  1. Remove the substitution for '' to *:* in biocache-hubs
  2. Change the SOLR schema so there is no defaultQuery, so if no q, then we DON'T substitute *:*.
adam-collins commented 9 months ago

No longer an issue.