Closed timrobertson100 closed 6 years ago
A poolsize of 12 is not really excessive, so decreasing these will not help: https://github.com/gbif/gbif-configuration/blob/master/checklistbank-ws/prod/application.properties#L6
I think the leak detection is simply triggered because the sitemap query takes a long time and we have configured the ws (hikari) to assume its a leak after 60s https://github.com/gbif/gbif-configuration/blob/master/checklistbank-ws/prod/application.properties#L10
From Hikari:
This property controls the amount of time that a connection can be out of the pool before a message is logged indicating a possible connection leak. A value of 0 means leak detection is disabled. Lowest acceptable value for enabling leak detection is 2000 (2 seconds). Default: 0
We could increase the limit or deactivate it alltogether (which is the default we use probably everywhere else)
There are as many INFOs reporting that the connection in fact did not leak as there are WARNINGs about a potential leak, so the theory above sounds likely:
Previously reported leaked connection org.postgresql.jdbc.PgConnection@67837d28 was returned to the pool (unleaked)
So it seems we just have some slow postgres queries. Maybe some indices are bad. It is caused by various methods, not just the sitemap index.
NameUsageServiceMyBatis.listByCanonicalName(NameUsageServiceMyBatis.java:139)
SpeciesSitemapResource.lambda$sitemap$0(SpeciesSitemapResource.java:104)
NameUsageServiceMyBatis.listRoot(NameUsageServiceMyBatis.java:196)
SpeciesSitemapResource.sitemapIndex(SpeciesSitemapResource.java:75)
postgres logs are full with slow queries selecting by the canonical name (see 60% cause above). For some reason postgres uses an uneffective query plan when the canonical name exists many times in the dataset. E.g. when querying for "Abies" it takes 12ms to yield no records:
prod_checklistbank=> EXPLAIN analyze SELECT * FROM name_usage u join name n on u.name_fk=n.id WHERE u.deleted IS NULL AND lower(n.canonical_name) = lower('Abies') and u.dataset_key in ('4cec8fef-f129-4966-89b7-4f8439aba058') ORDER BY u.id LIMIT 20 OFFSET 0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4053.08..4053.08 rows=3 width=508) (actual time=11.098..11.098 rows=0 loops=1)
-> Sort (cost=4053.08..4053.08 rows=3 width=508) (actual time=11.097..11.097 rows=0 loops=1)
Sort Key: u.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.12..4053.05 rows=3 width=508) (actual time=11.085..11.085 rows=0 loops=1)
-> Index Scan using name_lower_idx on name n (cost=0.56..157.10 rows=80 width=205) (actual time=10.230..10.337 rows=19 loops=1)
Index Cond: (lower(canonical_name) = 'abies'::text)
-> Index Scan using name_usage_name_fk_idx on name_usage u (cost=0.56..48.69 rows=1 width=303) (actual time=0.038..0.038 rows=0 loops=19)
Index Cond: (name_fk = n.id)
Filter: ((deleted IS NULL) AND (dataset_key = '4cec8fef-f129-4966-89b7-4f8439aba058'::uuid))
Rows Removed by Filter: 2
Total runtime: 11.293 ms
(12 rows)
But when looking for "spec." the plan does not use the lower canonical name filter and it takes over 20s:
prod_checklistbank=> EXPLAIN analyze SELECT * FROM name_usage u join name n on u.name_fk=n.id WHERE u.deleted IS NULL AND lower(n.canonical_name) = lower('spec.') and u.dataset_key in ('4cec8fef-f129-4966-89b7-4f8439aba058') ORDER BY u.id LIMIT 20 OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.00..8204.53 rows=20 width=508) (actual time=13639.520..13639.693 rows=20 loops=1)
-> Nested Loop (cost=1.00..6169880.07 rows=15042 width=508) (actual time=13639.518..13639.684 rows=20 loops=1)
-> Index Scan using name_usage_pkey on name_usage u (cost=0.56..4597352.52 rows=587565 width=303) (actual time=13639.432..13639.456 rows=20 loops=1)
Filter: ((deleted IS NULL) AND (dataset_key = '4cec8fef-f129-4966-89b7-4f8439aba058'::uuid))
Rows Removed by Filter: 19104809
-> Index Scan using name_pkey on name n (cost=0.43..2.67 rows=1 width=205) (actual time=0.009..0.009 rows=1 loops=20)
Index Cond: (id = u.name_fk)
Filter: (lower(canonical_name) = 'spec.'::text)
Total runtime: 13639.887 ms
Pff, we have 693.344 name records with a canonical name = "spec.". That explains the problem, but makes it hard to find a simple solution
Many of these names are badly parsed and the canonical name is very wrong:
prod_checklistbank=> SELECT * FROM name n WHERE lower(n.canonical_name) = lower('spec.') limit 50;
id | scientific_name | canonical_name | type | genus_or_above | infra_generic | specific_epithet | infra_specific_epithet | cultivar_epithet | notho_type | authors_parsed | autho
---------+---------------------------------------------------+----------------+----------+----------------+---------------+------------------+------------------------+------------------+------------+----------------+------
4912135 | Gyrocarpus sp. Chase 317 | spec. | INFORMAL | Gyrocarpus | Chase | | | | | f |
4912136 | Gyrocarpus sp. DES-2011 | spec. | INFORMAL | Gyrocarpus | Des- | | | | | t |
4912338 | Gyrodactylus sp. AGV-2009a | spec. | INFORMAL | Gyrodactylus | Agv- | | | | | t |
4912339 | Gyrodactylus sp. AGV-2009b | spec. | INFORMAL | Gyrodactylus | Agv- | | | | | t |
4912343 | Gyrodactylus sp. Chile | spec. | INFORMAL | Gyrodactylus | Chile | | | | | t |
4912363 | Gyrodactylus sp. HSS-2009 | spec. | INFORMAL | Gyrodactylus | Hss- | | | | | t |
4912385 | Gyrodactylus sp. Ladoga | spec. | INFORMAL | Gyrodactylus | Ladoga | | | | | t |
4912386 | Gyrodactylus sp. Ladoga x Gyrodactylus pannonicus | spec. | HYBRID | | | | | | | f |
4912388 | Gyrodactylus sp. MBS-2014 | spec. | INFORMAL | Gyrodactylus | Mbs- | | | | | t |
4912390 | Gyrodactylus sp. MPV-2015 | spec. | INFORMAL | Gyrodactylus | Mpv- | | | | | t |
4912394 | Gyrodactylus sp. NKA-2015 | spec. | INFORMAL | Gyrodactylus | Nka- | | | | | t |
4912396 | Gyrodactylus sp. North Sea | spec. | INFORMAL | Gyrodactylus | North | | | | | t | Sea
4912398 | Gyrodactylus sp. Norway-HH-2003 | spec. | INFORMAL | Gyrodactylus | Norway- | | | | | f |
4912407 | Gyrodactylus sp. Poland-MZ-2003 | spec. | INFORMAL | Gyrodactylus | Poland- | | | | | f |
4912413 | Gyrodactylus sp. Zimbabwe | spec. | INFORMAL | Gyrodactylus | Zimbabwe | | | | | t |
4912428 | Gyrodactylus spec. Nordmann, 1832 | spec. | INFORMAL | Gyrodactylus | Nordmann | | | | | t |
4912565 | Gyrocotyle sp. Tasmania | spec. | INFORMAL | Gyrocotyle | Tasmania | | | | | t |
4913863 | Gyrodinium sp. GeoB 231 | spec. | INFORMAL | Gyrodinium | Geo | | | | | f |
4913996 | Gyromitra sp. Gyr3 | spec. | INFORMAL | Gyromitra | Gyr | | | | | f |
4914171 | Gyrodactylus pannonicus X Gyrodactylus sp. Ladoga | spec. | HYBRID | | | | | | | f |
4914296 | Gyrodactylus pomeraniae x Gyrodactylus lavareti | spec. | HYBRID | | | | | | | f |
4914360 | Gyroneuron sp. BOLD:AAI1989 | spec. | INFORMAL | Gyroneuron | Bold | | | | | t | Aai
4914380 | Gyroneuronella sp. AZR-2008 | spec. | INFORMAL | Gyroneuronella | Azr- | | | | | t |
4914558 | Gyrodontium sp. BAB-5180 | spec. | INFORMAL | Gyrodontium | Bab- | | | | | f |
4915887 | Gyrostemon sp. Cranfield 02068672 | spec. | INFORMAL | Gyrostemon | Cranfield | | | | | f |
4916067 | Gyroporus sp. AWW-2009a | spec. | INFORMAL | Gyroporus | Aww- | | | | | t |
4916070 | Gyroporus sp. Arora 00-429 | spec. | INFORMAL | Gyroporus | Arora | | | | | f |
4916072 | Gyroporus sp. Arora00-429 | spec. | INFORMAL | Gyroporus | Arora | | | | | f |
4916945 | Gyrovirus 4 | spec. | VIRUS | | | | | | | f |
4916951 | Gyrovirus GyV3 | spec. | VIRUS | | | | | | | f |
4916954 | Gyrovirus GyV7-SF | spec. | VIRUS | | | | | | | f |
4916956 | Gyrovirus GyV8 | spec. | VIRUS | | | | | | | f |
4916959 | Gyrovirus GyV9 | spec. | VIRUS | | | | | | | f |
4916962 | Gyrovirus Tu243 | spec. | VIRUS | | | | | | | f |
4916965 | Gyrovirus Tu789 | spec. | VIRUS | | | | | | | f |
4916977 | Gyrovirus: Chicken anemia virus ICTV | spec. | VIRUS | | | | | | | f |
4916980 | Gyrovirus: chicken anemia virus Ictv | spec. | VIRUS | | | | | | | f |
4917282 | Gyrophyllum sp. NTM-C014392 | spec. | INFORMAL | Gyrophyllum | Ntm- | | | | | f |
4917368 | Gyrtona sp. BOLD:AAI6410 | spec. | INFORMAL | Gyrtona | Bold | | | | | f |
4917371 | Gyrtona sp. Gyrt | spec. | INFORMAL | Gyrtona | Gyrt | | | | | t |
4917446 | Gyrtothripa sp. BOLD:AAH4619 | spec. | INFORMAL | Gyrtothripa | Bold | | | | | f |
4917650 | H-1 parvovirus | spec. | VIRUS | | | | | | | f |
4917660 | H-Pelican lacZ transformation vector | spec. | VIRUS | | | | | | | f |
4917662 | H-Stinger GFP transformation vector | spec. | VIRUS | | | | | | | f |
4918192 | HCBI8.215 virus | spec. | VIRUS | | | | | | | f |
I have increased caching of sitemaps from 30 minutes to 1 week to avoid postgres load (The backbone rarely changes): https://github.com/gbif/c-deploy/commit/c2dc6f3081e9d509bd1c16fa045449bf9b8b3c87
Nothing else to be done for now, closing. Improved name parsing is dealt with in the new above issue
I found a lot of these in ELK
This seemingly resulted in a lot of PG connections acquired by CLB and brought down the registry services on the same DB which all reported timeouts getting connections.
I think we need to find the cause of this leak, and also consider dropping pool sizes to not starve other apps.