Open dersteppenwolf opened 4 years ago
Thanks @dersteppenwolf . Confirmed. There are multiple factors impacting performance, mainly:
The Health checks themselves are run multi-threaded and may even be run multi-process+-threaded (synced via DB).
But the main bottleneck, also affecting the webapp is IMO the Database. Though it may help to set some indexes like you pointed out, it has been acknowledged in longstanding issue #76 that the use of SQLAlchemy
together with a huge Run
table automatically issues multiple very inefficient SQL queries, mainly on all (summary) statistics. Some improvements have already been made but #76 is still open. A summary (cache) table is one suggested solution but not optimal. The Run
table is in effect time-series data so may need special treatment or very proper indexing. TimeScaleDB
may be another option.
Other related refs are #43 #94 #149 #232 .
@dersteppenwolf tested your index settings. Makes indeed a huge performance difference!
Appearantly can be applied for both SQLIte and Postgres. Think we can apply these using SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/constraints.html#index-api Possibly an update to the tables defined in models.py
(setting index=true
). Hopefully that is enough, otherwise also an Alembic migration needs to be done for existing DBs.
I've reported a similar issue, not noticing that it was actually a duplicate (see #391, thanks @bart-v). With some preliminary testing I found out that creating indexes improved the page speed by a factor of (at least) 5, and caching calculated results (while still using SQL Alchemy for filtering and counts) in the Resource class by another factor 3 - 4.
This can probably be improved more, by storing the runs in a Python list but then runs.count(), .filter_by() and .order_by() need to be replaced by Python code like len, list comprehension with if, and sort respectively.
Regarding SQLite vs Postgres: without indexes I didn't perceive a notable difference (with several thousands of runs), but it seemed that creating indexes in SQLite didn't make a much a difference as it did in Postgres. I'd recommend Postgres though. I also found out that running the runner inside the web app made the webapp perform poorly, even though it can be seen that the runner stops working while the page is being created. I'd also recommend to set the number of workers for the webapp to 4 (the default). Especially if multiple users can view the web app.
Here is some of the change to the Resource model as an improvement:
@property
def first_run(self):
if not hasattr(self, '_first_run'):
_first_run = self.runs.order_by(Run.checked_datetime.asc()).first()
setattr(self, '_first_run', _first_run)
return self._first_run
New prop:
@property
def run_count(self):
if not hasattr(self, '_run_count'):
setattr(self, '_run_count', self.runs.count())
return self._run_count
There is no need for a single resource to query the database more often than necessary. This code still uses SQLAlchemy, but its usage is minimized, thanks to the caching of the properties.
I've created 2 PR's (#393 and #394) addressing the issues mentioned above. I deliberately chose to create two different PR's, because the changes do not depend on each other. If one of the PR's is not accepted in the current state, it does not block the merging of the other PR.
Regarding the improvement: the time to generate a page containing a resource list also increases about five fold. So, the combined improvement is about 25 times. Of course this is measured very unscientifically, but with a large number of results and runs both improvements are obvious.
just thinking out loud here - what happens in a 500 concurrent user scenario: would creating "static" png (e.g. munin) or svg files automatically updated after a defined time period(default harvesting time?) help instead of creating them manually on each page request ?
FWIW I added these indexes to postgres, on top of the indexes suggested by dersteppenwolf. This more than doubled our performance, but it's still not great.
CREATE INDEX ON run(resource_identifier,success); CREATE INDEX ON run(resource_identifier,checked_datetime);
Describe the bug The application is really slow, or even stops working, if there is a high number of services configured and / or there are too many historical records in the database.
To Reproduce Publish a high number of services (ie. 800) , keep 90 days of history and check services every hour. Looks like the home page tries to load too many records from the database trying to show the general stats of the instance and the web app fails.
Expected Behavior The web app loads successfully.
Suggestion In addition to reduce the number of days of history and frequency of checks it could help to include some indexes in the database.
For example: