akvo / akvo-lumen

Make sense of your data
https://akvo.org/akvo-lumen
GNU Affero General Public License v3.0
63 stars 18 forks source link

Dashboard not loading #2763

Closed janagombitova closed 4 years ago

janagombitova commented 4 years ago

From support: https://akvoo.reamaze.com/admin/conversations/akvolumen-support-request-number-366224-from-boroma-at-akvo-dot-org

Screen Shot 2020-07-15 at 09 03 51

On mdi.akvolumen.org the dashboard: "Riposte Au COVID-19 Au Mali - Effets À Court Terme Des Activités De Communication" is not loading in the editor or when shared. On the Console I get errors pointing to maps and then a 504.

janagombitova commented 4 years ago

More details here: https://docs.google.com/document/d/19TW6KwhqkMi48DCR2u-FMYvrjjq3As7IQYAM8CgHrIE/edit#

dlebrero commented 4 years ago

The issue is not the number of visualisations, but the number map visualisations, due to the "stampede" on the number of request to load those maps.

For example, https://demo.akvolumen.org/dashboard/5dfbb7ff-e627-4cfc-afcf-c87021d9ed41 has 35 visualisations, of which 8 are maps.

Loading the public dashboard, which contains all the data for all visualisation but the maps, takes 500 ms. This is reasonable.

But after loading the dashboard, the 8 maps make ~240 concurrent request to the backend to draw the maps. This is what is causing the DB to choke. Note that this is just for one single user.

I will look at what is the bottleneck on those requests, but it would be good to change the FE code to lazily load the maps as they the user scrolls down the dashboard.

janagombitova commented 4 years ago

@marvinkome can you take a look at Dan's suggestion?

dlebrero commented 4 years ago

Further looking into this, I think I have been looking at the wrong problem as there seems a bunch of conflated issues.

I see in the Lumen-Maps metrics and logs nothing that could explain why the DB was 100% CPU for several hours.

After broadening my search, I see something very suspicious in the Lumen-Backend usage of the DB connections: Screenshot 2020-07-28 at 09 32 27

So basically, mostly for the mdi tenant, sometimes the DB connection is in use for more than 2 hours. My expectation for a healthy service would be of at most a few seconds.

This means that the Lumen backend starts some work and that work hammers the DB for several hours. During those hours, I see that the Lumen-Backend runs out of DB connections for the mdi DB, which means that there is probably not just one job hammering the DB but several.

Looking at the GC logs:

Screenshot 2020-07-28 at 09 32 37

This reinforces the hypothesis of the Lumen backend doing a lot of work.

To debug the issue further, we need to trigger the error again and see the Lumen-Backend thread logs and what queries the DB is executing.

dlebrero commented 4 years ago

The offending dashboard only has one map, so it is unlikely that the issue is related to many maps.

dlebrero commented 4 years ago

The issue was that the dataset table needed to be vacuumed.

Found the issue by:

  1. Visiting again the offering dashboard.
  2. Look in ElephantSQL console for the slow queries.
  3. Perform a "explain analyze" of one of the slow queries.

The "explain analyze" showed:

            ->  Seq Scan on ds_6ee63ed8_95fa_4100_89bc_eba4b053fe3e  (cost=0.00..116647.15 rows=3435 width=11) (actual time=901.432..12333.207 rows=3064 loops=1)

So a table scan of ~3k rows was taking more than 10 secs. The expectation was a few milliseconds.

A google search for "postgres seq scan slow small table" suggested doing a vacuum. I did a full+analyze vacuum and the "explain analyze" shows:

           Seq Scan on ds_6ee63ed8_95fa_4100_89bc_eba4b053fe3e  (cost=0.00..826.26 rows=3364 width=11) (actual time=0.061..2.291 rows=3064 loops=1)

So a couple of millis.

Dashboard now loads.

Don't know why a manual vacuum was needed as it should be performed automatically. Maybe something to do with the other issue regarding duplicated dataset id that @tangrammer fixed.

dlebrero commented 4 years ago

It happened again today.

Upon looking into the issue further, we noticed that the offending dataset has 170 transformations.

This means that after an update, the dataset table ends up with 170 transformations * 3k rows =~ 500k dead rows. This would explain the performance issue and why the table scan on the table is slow.

Unfortunately we could not reproduce the issue. Updating the dataset works, we see the dead rows but Postgres is running the vacuuming automatically within seconds of loading the dashboard.

We notice that this morning the dataset update failed. Maybe this failure is what is causing Postgres to not do the vacuuming.

janagombitova commented 4 years ago

MDI reported another dashboard that is not loading: https://mdi.akvolumen.org/s/JACB8a7l3iU From what I remember, it has 1 map and according to Abdoulaye and Olivier under 100 visualisations.

I tried it last night and it loaded fine but this morning I got a 504.

Screen Shot 2020-08-04 at 08 32 06

There are 3 client improvements we need to make resulting from this issue, that I will create separate issues for:

janagombitova commented 4 years ago

And just now I see the comment from @dlebrero from last night. Is there anything we can do to address this?

dlebrero commented 4 years ago

Looking at the issue of the last dashboard mentioned, we see again that a full scan to the ds_ae6ce9c1_eff3_4f63_b082_66f99c30b746 table is very slow.

Doing a SELECT * FROM pg_stat_user_tables where relname = 'ds_ae6ce9c1_eff3_4f63_b082_66f99c30b746' order by n_tup_upd DESC; shows that there are 0 dead rows.

But doing a FULL vacuum fixes the performance. The vacuum log shows:

INFO:  vacuuming "public.ds_ae6ce9c1_eff3_4f63_b082_66f99c30b746"
INFO:  "ds_ae6ce9c1_eff3_4f63_b082_66f99c30b746": found 0 removable, 3622 nonremovable row versions in 126719 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU 0.56s/0.70u sec elapsed 19.45 sec.
INFO:  analyzing "public.ds_ae6ce9c1_eff3_4f63_b082_66f99c30b746"
INFO:  "ds_ae6ce9c1_eff3_4f63_b082_66f99c30b746": scanned 848 of 848 pages, containing 3622 live rows and 0 dead rows; 3622 rows in sample, 3622 estimated total rows`

So even if there are no dead rows, Postgres still needs to scan 126719 pages. After the vacuuming, Postgres just needs to can 878 pages. That is a x144 difference.

Doing a regular vacuum in other table with the same performance issue shows that a regular vacuuming does not "compact" the table. Two executions of vacuuming show:

 VACUUM (VERBOSE, ANALYZE) ds_d460b80b_62cf_4d09_a0b8_7027ac6fd291;
 ....
 INFO:  "ds_d460b80b_62cf_4d09_a0b8_7027ac6fd291": found 0 removable, 2 nonremovable row versions in 1 out of 123101 pages
 .....

Querying this table before a full vacuum:

tenant_mdi=# explain analyze select * from ds_d460b80b_62cf_4d09_a0b8_7027ac6fd291;

 Seq Scan on ds_d460b80b_62cf_4d09_a0b8_7027ac6fd291  (cost=0.00..123137.75 rows=3675 width=2054) (actual time=338.537..1849.011 rows=3622 loops=1)
 Planning time: 1.260 ms
 Execution time: 1862.721 ms

And after:

tenant_mdi=# explain analyze select * from ds_d460b80b_62cf_4d09_a0b8_7027ac6fd291;

 Seq Scan on ds_d460b80b_62cf_4d09_a0b8_7027ac6fd291  (cost=0.00..870.22 rows=3622 width=2052) (actual time=0.007..1.040 rows=3622 loops=1)
 Planning time: 1.104 ms
 Execution time: 1.274 ms

To summarise, a dataset with a few (~3k) rows and with a lot (~170) transformations is slow and needs a full vacuum.

janagombitova commented 4 years ago

@tangrammer can we close this issue? Is it done?

janagombitova commented 4 years ago

@tangrammer I assume we can close this issue now. If no, please comment and reopen. Good job on handling the issue 👍