Slow response in web app

dersteppenwolf commented 4 years ago

Describe the bug The application is really slow, or even stops working, if there is a high number of services configured and / or there are too many historical records in the database.

To Reproduce Publish a high number of services (ie. 800) , keep 90 days of history and check services every hour. Looks like the home page tries to load too many records from the database trying to show the general stats of the instance and the web app fails.

Expected Behavior The web app loads successfully.

Suggestion In addition to reduce the number of days of history and frequency of checks it could help to include some indexes in the database.

For example:


CREATE INDEX run_res_ident_idx  
ON public.run  (resource_identifier);

CREATE INDEX res_tag_res_ident_idx  
ON public.resource_tags  (resource_identifier);

CREATE INDEX res_tag_tag_idx  
ON public.resource_tags  (tag_id);

CREATE INDEX resource_res_type_idx  
ON public.resource  (resource_type);

justb4 commented 4 years ago

Thanks @dersteppenwolf . Confirmed. There are multiple factors impacting performance, mainly:

history days
Resource check frequency
Probe config (e.g. many Probes with extensive Checks vs just Capabilities testing)
Individual Resource URL endpoint data complexity (e.g. a WMS with 100s of Layers)
Number of configured Resources
Dimensioning (CPU, Mem) of system
Database used: Postgres vs SQLite
Separate Runner process(es) or Runner embedded in Flask App

The Health checks themselves are run multi-threaded and may even be run multi-process+-threaded (synced via DB).

But the main bottleneck, also affecting the webapp is IMO the Database. Though it may help to set some indexes like you pointed out, it has been acknowledged in longstanding issue #76 that the use of SQLAlchemy together with a huge Run table automatically issues multiple very inefficient SQL queries, mainly on all (summary) statistics. Some improvements have already been made but #76 is still open. A summary (cache) table is one suggested solution but not optimal. The Run table is in effect time-series data so may need special treatment or very proper indexing. TimeScaleDB may be another option.

Other related refs are #43 #94 #149 #232 .

justb4 commented 3 years ago

@dersteppenwolf tested your index settings. Makes indeed a huge performance difference!

Appearantly can be applied for both SQLIte and Postgres. Think we can apply these using SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/constraints.html#index-api Possibly an update to the tables defined in models.py (setting index=true). Hopefully that is enough, otherwise also an Alembic migration needs to be done for existing DBs.

fsteggink commented 2 years ago

I've reported a similar issue, not noticing that it was actually a duplicate (see #391, thanks @bart-v). With some preliminary testing I found out that creating indexes improved the page speed by a factor of (at least) 5, and caching calculated results (while still using SQL Alchemy for filtering and counts) in the Resource class by another factor 3 - 4.

This can probably be improved more, by storing the runs in a Python list but then runs.count(), .filter_by() and .order_by() need to be replaced by Python code like len, list comprehension with if, and sort respectively.

fsteggink commented 2 years ago

Regarding SQLite vs Postgres: without indexes I didn't perceive a notable difference (with several thousands of runs), but it seemed that creating indexes in SQLite didn't make a much a difference as it did in Postgres. I'd recommend Postgres though. I also found out that running the runner inside the web app made the webapp perform poorly, even though it can be seen that the runner stops working while the page is being created. I'd also recommend to set the number of workers for the webapp to 4 (the default). Especially if multiple users can view the web app.

Here is some of the change to the Resource model as an improvement:

    @property
    def first_run(self):
        if not hasattr(self, '_first_run'):
            _first_run = self.runs.order_by(Run.checked_datetime.asc()).first()
            setattr(self, '_first_run', _first_run)

        return self._first_run

New prop:

    @property
    def run_count(self):
        if not hasattr(self, '_run_count'):
            setattr(self, '_run_count', self.runs.count())

        return self._run_count

There is no need for a single resource to query the database more often than necessary. This code still uses SQLAlchemy, but its usage is minimized, thanks to the caching of the properties.

fsteggink commented 2 years ago

I've created 2 PR's (#393 and #394) addressing the issues mentioned above. I deliberately chose to create two different PR's, because the changes do not depend on each other. If one of the PR's is not accepted in the current state, it does not block the merging of the other PR.

fsteggink commented 2 years ago

Regarding the improvement: the time to generate a page containing a resource list also increases about five fold. So, the combined improvement is about 25 times. Of course this is measured very unscientifically, but with a large number of results and runs both improvements are obvious.

hannesaddec commented 2 years ago

just thinking out loud here - what happens in a 500 concurrent user scenario: would creating "static" png (e.g. munin) or svg files automatically updated after a defined time period(default harvesting time?) help instead of creating them manually on each page request ?

hyperman2 commented 10 months ago

FWIW I added these indexes to postgres, on top of the indexes suggested by dersteppenwolf. This more than doubled our performance, but it's still not great.

CREATE INDEX ON run(resource_identifier,success); CREATE INDEX ON run(resource_identifier,checked_datetime);

geopython / GeoHealthCheck

Slow response in web app #326