AtlasOfLivingAustralia / alerts

Alerts services
https://alerts.ala.org.au
Other
1 stars 7 forks source link

Alert list timing out #319

Open turley85 opened 3 weeks ago

turley85 commented 3 weeks ago

I'm getting a 504 Gateway Time-out the previewing the alert:

BioSecurity alert for NSW_NPWS_Western_Weeds_list

nickdos commented 3 weeks ago

That URL is working for me now. Is it still showing 504 for you?

turley85 commented 3 weeks ago

Sorry, that URL was for the list.

It's the alert associated with that list that is failing... I can't get a URL directly to the alert itself to work sorry.

nickdos commented 3 weeks ago

Looking at logs, these errors are resulting:

2024-10-24 08:09:33.812 ERROR --- [.1-8080-exec-26] au.org.ala.alerts.BiosecurityService     : Server returned HTTP response code: 504 for URL: https://biocache.ala.org.au/ws/occurrences/search?q=%28genus%3A%22Alternanthera+philoxeroides%22%29+OR+%28species%3A%22Alternanthera+philoxeroides%22%29+OR+%28subspecies%3A%22Alternanthera+philoxeroides%22%29+OR+%28scientificName%3A%22Alternanthera+philoxeroides%22%29+OR+%28raw_scientificName%3A%22Alternanthera+philoxeroides%22%29&fq=-data_resource_uid%3A%22dr27665%22+AND+spatialObject%3A9433219+OR+spatialObject%3A9433227&fq=eventDate%3A%5B2024-05-23T14%3A00%3A00Z+TO+2024-10-23T21%3A08%3A33Z+%5D&fq=firstLoadedDate%3A%5B2024-10-20T13%3A00%3A00Z+TO+2024-10-23T21%3A08%3A33Z+%5D&pageSize=10000

Testing that URL manually, resulted in 504 Gateway Time-out and not the usual SOLR error you see when the spatial_object is too long.

I'm guessing the spatial_object is still to blame (too complex) and resulting in SOLR timing out or running out of memory.

UPDATE: I think the fq column might not be written correctly too. E.g. -data_resource_uid:"dr27665" AND spatialObject:9433219 OR spatialObject:9433227 - Boolean precedence means that the AND will take precedence over the OR, resulting in (effectively) (-data_resource_uid:"dr27665" AND spatialObject:9433219) OR spatialObject:9433227. So it will (effectively) return all the results that match spatialObject:9433227 due to the last OR.

I think the intended result should use: -data_resource_uid:"dr27665" AND (spatialObject:9433219 OR spatialObject:9433227).

UPDATE 2: Reminder: the spatial object should be tested independently before using in a fq.

https://biocache.ala.org.au/ws/occurrences/search?q=spatialObject:9433227

results in

{
message: "Error from server at null: Expected mime type application/octet-stream but got application/json. {  "error":{    "metadata":[      "error-class","org.apache.solr.common.SolrException",      "root-error-class","org.apache.solr.common.SolrException"],    "msg":"application/x-www-form-urlencoded content length (74308530 bytes) exceeds upload limit of 32768 KB",    "code":400}}",
errorType: "Query syntax invalid",
statusCode: 400
}

FYI, we advise that you do not combine spatialObject in a fq too. By combining 2 spatialObject's, you are in effect, causing the same error shown above (internally its like using one combined object).

kylie-m commented 3 weeks ago

Thanks for investigating Nick! Adding some other relevant background here, spatialObject:9433227 was one of the original shapefiles that was too complex and needed optimising. There's more info on ticket #246 but essentially there is an optimised version of it that I created, spatialObject:9439588. So that should be the one used in alerts.

Can be viewed and tested at: https://spatial.ala.org.au/?pid=9439588 https://biocache.ala.org.au/ws/occurrences/search?q=spatialObject:9439588

turley85 commented 3 weeks ago

@kylie-m @nickdos I just updated that spatial object from spatialObject:9433227 to https://spatial.ala.org.au/?pid=9439588. However, the alert still failed.

I note that the list actually runs of 3 shapefiles, so is one of the other two causing this issue too? Or is having multiple shapefiles itself causing the issue?

nickdos commented 3 weeks ago

Hi @turley85 - I saw this in the logs:

2024-10-24 12:56:59.648 ERROR --- [.1-8080-exec-30] au.org.ala.alerts.NotificationService : User or query not found for userId: null, queryId: BioSecurity alert for NSW_NPWS_Western_Weeds_list

userId: null,

So I think you had the page loaded from earlier and then clicked the "Preview" or "Notify" but your login had expired. So try reloading the page and see if you're prompted to login again. And then try running it again.

turley85 commented 3 weeks ago

@nickdos Hmm, I just tried again. Closed the windows, logged out and then back into ALA and used "Preview" to test the alert again.

It failed again sorry :(

Let me know if there's something else I should have done to test!

kylie-m commented 3 weeks ago

I just tested https://biocache.ala.org.au/ws/occurrences/search?q=spatialObject:9433219 as well, so that spatialObject should be ok. I didn't spot a third one on the list though?

nickdos commented 3 weeks ago

Same error again: 2024-10-24 12:56:59.648 ERROR --- [.1-8080-exec-30] au.org.ala.alerts.NotificationService : User or query not found for userId: null, queryId: BioSecurity alert for NSW_NPWS_Western_Weeds_list.

Will look into it more.

nickdos commented 2 weeks ago

Seems the timeouts are causing the DB to error (as described in other ticket), so the DB lookup for the query ID subsequently fails.

So fix is to remove all but one spatialObject in the list fq column, and re-try.

@turley85, we strongly recommend you take a copy of the list over to lists-test.ala.org.au and do the testing on alerts-test.ala.org.au., before making changes on production servers.

kylie-m commented 2 weeks ago

@nickdos would a good additional workaround here be to combine the 2 spatial layers into one layer in QGIS first? No guarantees but I can give that a try, have done so for other work previously

nickdos commented 2 weeks ago

@nickdos would a good additional workaround here be to combine the 2 spatial layers into one layer in QGIS first?

@kylie-m - I think so. Combining spatialObjects adds an extra level of complexity and depending on how they are combined, could be worse than a single object. So simpler/safer to stick with a single spatialObject, as recommended by Adam.

kylie-m commented 2 weeks ago

Thanks Nick!

@turley85 I have merged the 2 layers in QGIS, then uploaded to Spatial portal.

In ala-test: resulting object: spatialObject:21643483 test: https://api.test.ala.org.au/occurrences/occurrences/search?q=Acacia%20longifolia&fq=spatialObject%3A21643483 test in UI to view on map (will be quite slow): https://biocache-test.ala.org.au/occurrences/search?q=taxa&fq=spatialObject:21643483

is returning records within the new spatial object above, though the equivalent alert on test is not yet working - I'll keep trying, but @nickdos if you have any ideas, let me know! (https://lists-test.ala.org.au/speciesListItem/list/dr22890)

In production: resulting object: spatialObject:9478102 test: https://biocache.ala.org.au/ws/occurrences/search?q=spatialObject:9478102

Alert is working for this test list: https://lists.ala.org.au/speciesListItem/list/dr28737 (Alert name: "wattle")

Other Docs:

turley85 commented 2 weeks ago

@kylie-m @nickdos I've updated the NPWS Western list with spatialObject:9478102 in production and still getting 504 Gateway timeout sorry.

However, I did replicate @kylie-m's result with the wattle list:

76 new records for wattle, dr28737 since 16 Oct 2024

kylie-m commented 2 weeks ago

hmm I wonder if the spatialObject is too complex when in combination with a more complex query, but just ok with a simpler query, @nickdos ?

nickdos commented 2 weeks ago

@kylie-m I wondered the same thing - I think the additional terms for the OR'ed names might be pushing us over some threshold value. Only way to know is run the alert and look at logs, I think.

503 gateway timeout is usually an indicator, the biocache requests are timing out or erroring.