IATI / D-Portal

http://d-portal.org/
Other
30 stars 23 forks source link

d-portal is sometimes down for a few minutes #630

Closed andreaszenasidi closed 2 years ago

andreaszenasidi commented 2 years ago

It was reported by a publisher that this morning (October 26) d-portal was not working. I also noticed that some days mainly in the morning for a few minutes the tool is down, when you click any search option it's just showing as 'Searching'.

Would be great to understand what might cause this issue.

notshi commented 2 years ago

Thanks, @andreaszenasidi for letting us know.

Would you know how long is a few minutes and what time exactly, if you can remember? Has this been happening daily or just a cluster of days? It would be helpful to get precise information to track intermittent down time to investigate or replicate it.

One reason could be that the OCHA COVID-19 Funding Dashboard has been using d-query as their API or it could also be someone else. This is especially the case when the queries are more complex to return large datasets.

Such queries can clog up the back end, and block any queries made on the front page. This sounds like what you are experiencing.

We have been providing many services out there, over the years, with our robust and reliable API, and with growing data needs, we will probably need some support for d-portal to maintain this.

andreaszenasidi commented 2 years ago

@notshi thanks for sharing the potential reason and it's understandable the need for support to maintain the high traffic. I shared this with my team.

In terms of when the issue happened, yesterday the issue was reported to us at 09:16 am BST and they said within a half an hour it was working. When I experienced this myself it was a few days ago and I did not track exactly the time and length. If I experience it again I will update this ticket.

andreaszenasidi commented 2 years ago

@notshi This just happened again, the timeframe is: 8:36-8:50am GMT.
image

notshi commented 2 years ago

Thanks, @andreaszenasidi for reporting. Will see if anything comes up in the server logs for those timings.

notshi commented 2 years ago

Hi @andreaszenasidi we've done a mild investigation and these are our findings.

The IP addresses belong to an Amazon Cloud service and it's downloading large datasets using python scripts between 7-9 GMT.

They seem to be downloading all IATI data after a certain date (30 Nov 2017) split into requests by country and all the downloads are for csv format.

The problem is that some of these queries return a lot of data and obviously, that is not what d-portal is designed for (ie. we are not a datastore) so d-portal is unable to accommodate.

We have a couple of suggestions - we could block that python script but it might block other people. And it looks like this is a genuine data need so we would rather not do that.

The other option is to switch to a streaming response which would vastly reduce the problems of large datasets being returned by queries.

This is our preferred option and would make it possible for us to cancel the data request part way through if it seems too large.

andreaszenasidi commented 2 years ago

@notshi thank you for looking into this issue. I am looping @amy-silcock in.

amy-silcock commented 2 years ago

Thanks you for investigating Shi.

What would the streaming option involve and much resource would be needed to get this implemented? It would also be great if we could find out who the user is and what they need. As mentioned, dquery is not the ideal place to get this data from. Is there a way we could get a message to them asking them to contact us?

notshi commented 2 years ago

@amy-silcock The streaming option would just be a couple of days of development and allow us to slow down or cancel a request halfway through if it's too large or taking too long. This should help alleviate the server burden.

The only way to get a message to them is add stuff to the data, which is not ideal.

At present, they are already facing errors and they haven't gotten in touch so it might be that they are not really paying attention.

In any case, the streaming option should stop it being a problem on our end when users do their data queries.

amy-silcock commented 2 years ago

Sounds good, yes to working on the streaming option now. We can implement this before the end of the year, do send us an e-mail if you need resources for this.

siemvaessen commented 2 years ago

fyi, this provides an overview of a monitor we have in place for D-Portal (threshold set to 1 min) - https://stats.uptimerobot.com/rEyxoh8gVm/788866697

notshi commented 2 years ago

Thanks, @siemvaessen - interesting to see that uptime didn't seem to catch when d-portal was down on 5th Nov for 20 min but it seems to record the small reboots we do nightly.

notshi commented 2 years ago

d-portal now supports streaming so should not have issues when it comes to large results.

We are keeping an eye on the server and doing some minor tests, just in case. There are a lot of small changes made to the database and dQuery but these should not be noticeable to anyone using the front end.

However, do let us know if you're still facing any downtime issues or the like @andreaszenasidi @amy-silcock

AudreyIATI commented 2 years ago

Thanks for this Shi, we've had a message on Zendesk just now from a user who isn't getting any results:

Just to let you know that d-portal.org seems to be down at the moment and doesn’t return any activities. It’s been like that for roughly an hour now.

notshi commented 2 years ago

Thanks for letting us @AudreyIATI - will investigate further and see if we can find out why this is still happening.

notshi commented 2 years ago

@AudreyIATI We are not seeing any errors on the server or via various uptimerobots so not sure what is happening there. It could be DNS problems since we did switch recently.

When you say that d-portal.org was down, was it for yourself or it includes other people? As for the user on Zendesk, could we get more information about not getting any results.

Does that mean they were able to access the website but nothing happens when they use the search filters?

Technically, there should not be any problems accessing the website unless the internet is down for the user or the server.

Currently the problem is catching the exact moment when the site is down and that is proving to be difficult.

ahokelsey commented 2 years ago

General user here, The error, when I access the site its searches endlessly (see photos above by @andreaszenasidi), has been occurring for the last 3 hours. *Appreciate that you're working on it!

notshi commented 2 years ago

Thanks @ahokelsey that's very helpful. I think we may have found out what it was.

It looks like server connections were being left hanging so eventually they would all fill up and it would wait a while before you were allowed to connect again.

The fix is now live so hopefully, d-portal should be up and running again for everyone. And more importantly, hammering the refresh button should not affect the site.

@AudreyIATI Many thanks for your patience and reports. This bug has been triggered by the streaming change so is a different one from the original issue.

AudreyIATI commented 2 years ago

Thanks for the update @notshi So is the original issue resolved or does it require further investigation?

notshi commented 2 years ago

Unless there are still issues with intermittent downtime, I'd say it is ok to close this.

AudreyIATI commented 2 years ago

Ok great - I'll raise another issue if anything comes up.

Thanks again!

andreaszenasidi commented 2 years ago

Thanks @notshi!