Open Quuxplusone opened 9 years ago
Bugzilla Link | PR23474 |
Status | NEW |
Importance | P normal |
Reported by | Renato Golin (rengolin@gmail.com) |
Reported on | 2015-05-11 05:05:29 -0700 |
Last modified on | 2016-07-11 00:12:49 -0700 |
Version | unspecified |
Hardware | PC Linux |
CC | chris.matthews@apple.com, daniel@zuster.org, danielgb@au1.ibm.com, kristof.beyls@arm.com, kristof.beyls@gmail.com, rengolin@gmail.com, tobias@grosser.es, tonic@nondot.org |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
Daniel,
When database insert/update queries take too long, it's generally because there are too many indexes. MySQL is generally "less is more", but Postgres tends to be heavy handed with foreign keys and unique columns.
Another issue that makes write queries slow is the number of columns / accessory tables that are linked to the table you're updating. Maybe, if we could simplify the tables, and relax a bit the indexes, we should get back instantaneous updates.
I don't know much about Postgres, but MySQL has a query analyser, which gives you how the query is executed and what's slowing down, making it very easy to spot the slow key. If Postgres has the same, we should use it and change the query / table.
I've set up a local postgres server to try and replicate the issues seen on
llvm.org/perf
So far, I'm inserting semi-artificial run reports into the database.
Import the first run's data happened in about a second.
After importing about 1500 run reports, the import takes close to 15 seconds.
After configuring postgres to log all queries running for more than 250ms,
the following query shows up for each import of a run.
The IN-clause contains many runids. Maybe all runids in the database; or maybe
all runids for runs coming from the same machine?
After inserting data reflecting roughly 1500 LNT benchmarking runs, the query
returns 965700 rows. That does seem like a massive bandwidth bottleneck between
the postgres server and the LNT application.
2015-05-11 17:07:11 BST 127.0.0.1 5550d3ac.28dfLOG: duration: 1511.224 ms
statement: SELECT "NT_Sample"."RunID" AS "NT_Sample_RunID",
"NT_Sample"."TestID" AS "NT_Sample_TestID", "NT_Sample".compile_status AS
"NT_Sample_compile_status", "NT_Sample".execution_status AS
"NT_Sample_execution_status", "NT_Sample".compile_time AS
"NT_Sample_compile_time", "NT_Sample".execution_time AS
"NT_Sample_execution_time", "NT_Sample".score AS "NT_Sample_score"
FROM "NT_Sample"
WHERE "NT_Sample"."RunID" IN (1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 416, 418, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 947, 948, 949, 950, 951, 952, 953, 954, 955, 957, 958, 959, 960, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092)
If my guess is correct that the IN list grows as the number of runs grows, than
we have a design problem.
What is the purpose of this select? We may not even need it, as it seems really
(*really*) generic.
The last few frames on the trace to the code that triggers the long-running
query is:
in lnt/server/db/testsuitedb.py:
...
def _importSampleValues(self, tests_data, run, tag):
### adding samples
lnt.server.db.fieldchange.regenerate_fieldchanges_for_run(self, run)
in lnt/server/db/fieldchange.py:
...
def regenerate_fieldchanges_for_run(ts, run):
runs = ts.query(ts.Run).filter(ts.Run.order_id == run.order_id).all()
previous_runs = ts.get_previous_runs_on_machine(run, 1)
next_runs = ts.get_next_runs_on_machine(run, 1)
...
# Load our run data for the creation of the new fieldchanges.
runs_to_load = [r.id for r in (runs + previous_runs + next_runs)]
runinfo = lnt.server.reporting.analysis.RunInfo(ts, runs_to_load)
# At this point runs_to_load is seemingly filled with the runids
# for all runs on that particular machine; leading to the big query
# in the comment above.
in lnt/server/reporting/analysis.py:
class RunInfo(object):
def __init__(self, testsuite, runs_to_load,
aggregation_fn=stats.safe_min, confidence_lv=.05):
self.testsuite = testsuite
self.aggregation_fn = aggregation_fn
self.confidence_lv = confidence_lv
self.sample_map = util.multidict()
self.loaded_run_ids = set()
# Here, it loads all samples from all runs_to_load...
self._load_samples_for_runs(runs_to_load)
I'm not sure what the intended functionality is of
regenerate_fieldchanges_for_run.
I think field change objects are stored in the database, and are used to optimize the reporting processes. There should be one field change per detected regression.
I don't think we need all samples to generate them, just recent samples.
It's possible that the complexity of adding new fields in the IN list are polynomial or even exponential, so keeping them small would be a good thing.
If the python module could list them and "tail 20", I think that would move times in the microseconds again.
I wonder if the N parameters are not working correctly? The number of runs that should find is 1 + 1 + runs on current rev. That should not be very many.
And what should be "runs on current rev"?
From Kristof's dump, it's about 1000 entries!
I think that should be all the runs with the same rev as the run just submitted. That should not be that many. In most cases under 10 i'd guess. It would depend on how many machines are submitting at that rev.
Ah, Kristof mentioned that he was submitting the same json over and over again, so it would have the same rev every time. So the slowdown is expected in that case. LNT will use all the data at the current in the field change calculation.
Now that I think of it, we had a bug in the rev naming which was causing all of our runs to get the same rev, and that is when our LNT server was hanging.
I have added a warning to the code in r237057 since I think this is almost always an error.
So, back to the original exception... why is the public server running out of db connections.
"Why is the server running out of DB connections?"
-> I don't know.
I couldn't reproduce the "TimeoutError: QueuePool limit of size
5 overflow 10 reached, connection timed out, timeout 30"
error on my local setup.
However, I've instrumented sqlalchemy with a few print-debug
statements to see if sqlalchemy sessions did get closed or not
when using the LNT website.
It's clear that the sessions are not closed. Therefore, my best
guess at the moment is that there may be a not-often used code
path that gets triggered when using some webviews that results
in the sessions to retain a connection from the connection pool.
It does seem that on the few webpages I've tried, the connection
used by a session got returned to the pool, even if the session
wasn't closed. ...or maybe this happens when a request times out...
Adding the following code does result in sessions being closed
when a web request is completed. I'm not all sure however if
doing this in a __del__ method is the right way to do it.
diff --git a/lnt/server/db/v4db.py b/lnt/server/db/v4db.py
index 9e61b49..b4796c9 100644
--- a/lnt/server/db/v4db.py
+++ b/lnt/server/db/v4db.py
@@ -151,6 +151,10 @@ class V4DB(object):
return sum([ts.query(ts.Test).count()
for ts in self.testsuite.values()])
+ def close(self):
+ if self.session is not None:
+ self.session.close()
+
def importDataFromDict(self, data, config=None):
# Select the database to import into.
#
diff --git a/lnt/server/ui/app.py b/lnt/server/ui/app.py
index 1660100..e86886d 100644
@@ -81,6 +83,11 @@ class Request(flask.Request):
return self.testsuite
+ def __del__(self):
+ print "Request.__del__ called()"
+ if self.db is not None:
+ self.db.close()
+
class App(flask.Flask):
@staticmethod
def create_with_instance(instance):
(In reply to comment #11)
> "Why is the server running out of DB connections?"
> -> I don't know.
>
> I couldn't reproduce the "TimeoutError: QueuePool limit of size
> 5 overflow 10 reached, connection timed out, timeout 30"
> error on my local setup.
>
> However, I've instrumented sqlalchemy with a few print-debug
> statements to see if sqlalchemy sessions did get closed or not
> when using the LNT website.
> It's clear that the sessions are not closed. Therefore, my best
> guess at the moment is that there may be a not-often used code
> path that gets triggered when using some webviews that results
> in the sessions to retain a connection from the connection pool.
> It does seem that on the few webpages I've tried, the connection
> used by a session got returned to the pool, even if the session
> wasn't closed. ...or maybe this happens when a request times out...
>
> Adding the following code does result in sessions being closed
> when a web request is completed. I'm not all sure however if
> doing this in a __del__ method is the right way to do it.
It does not seem like a big hack and probably also won't hurt. I would almost
propose to just commit it and see if it helps.
I also moved on Monday morning my own LNT testers to a private server and
llvm.org/perf is now back online (correlated?). I want to note that I have been
running these servers for a very long time (and report a run every 5 commits).
At the beginning llvm.org/perf was reasonable stable and only over time it
seemed to get slower. Maybe the problematic behaviour only shows up on really
large databases?
Renato any news from the error log? Have there been any errors since the
builbot restart last night (and my LNT buildslaves being moved away).
Interesting to see is that even though llvm.org/perf was offline for a long
time, it seems we have many/all? of the performance results in the db.
Pretty consistent, ~6 errors / hour until 3 hours ago. Since then, nothing.
Did you add the close and restarted?
Here's an example:
2015-05-12 06:30:16,721 ERROR: Exception on /db_default/v4/nts/20207 [GET] [in
/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py:1423]
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed
out, timeout 30
I finally managed to reproduce the issue on a local setup using a postgres database as the backend. I haven't tried reproducing this with a sqlite backend.
One of the keys to reproduce this is to run the server multi-threaded, so multiple requests can be processed concurrently:
lnt runserver --threaded ./lnt_server_dirs/
Then doing concurrent web requests from the running server quickly results in the error message to be produced, and the server to start returning error 500:
$ for j in 1 2 3 4 5 6 7 8 9 10; do for i in 1 2 3 4 5 6 7 8 9; do echo $i; wget http://127.0.0.1:8000/db_default/v4/nts/machine/1 & done; done
This was on an instance with about 10K runs submitted, all from a single machine.
After running the above command line once or twice, I started seeing the TimeoutError. Even when the del statement to close the session (see my last comment above) has been inserted :(
(In reply to comment #14)
> $ for j in 1 2 3 4 5 6 7 8 9 10; do for i in 1 2 3 4 5 6 7 8 9; do echo $i;
> wget http://127.0.0.1:8000/db_default/v4/nts/machine/1 & done; done
Try 'ab' for that...
http://httpd.apache.org/docs/2.2/programs/ab.html
It's *really* good. :)
cheers,
--renato
(In reply to comment #13)
> Pretty consistent, ~6 errors / hour until 3 hours ago. Since then, nothing.
Interesting.
> Did you add the close and restarted?
The LNT server is restarted/reloaded on each commit to LNT. Kristof's patch was
not yet committed.
I only asked for a buildbot restart to stop my buildslaves to submit to
llvm.org/perf.
> Here's an example:
>
> 2015-05-12 06:30:16,721 ERROR: Exception on /db_default/v4/nts/20207 [GET]
> [in
> /opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.
> py:1423]
> TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
> timed out, timeout 30
I did not cross-check the exact timing, but there have indeed be submissions
that came in sufficiently close to probably be processed at the same time:
236969 2015-05-11 05:57:58 5:18:27 x86_64-penryn-O3-polly-detect-and-
dependences-only__clang_DEV__x86_64:38 View Results
236969 2015-05-11 05:57:57 5:18:18 x86_64-penryn-O3-polly-detect-
only__clang_DEV__x86_64:37 View Results
@Kristof: I have the feeling you are on the right track and it great we can
reproduce this outside of our production environment.
With a freshly created postgres database, with 1 json file for a LNT test-suite
run, I couldn't reproduce the error with the wget line in the comments above.
However by requesting a web page which presumably takes a little bit more
time to construct, I could reproduce the error:
for j in 1 2 3 4 5 6 7 8 9 10; do for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20; do echo $i; wget -q -O /dev/null
http:///db_default/v4/nts/global_status & done; done
This was run on a database created by running the following 3 commands first:
lnt create --db-dir=postgresql://lnt:password@127.0.0.1/ --default-db=lnt4 --
show-sql ./lnt_server_dirs4/
lnt import --commit=1 ./lnt_server_dirs4/ 28.json
lnt runserver --threaded --show-db-pool --hostname=127.0.0.1 ./lnt_server_dirs4/
See http://docs.sqlalchemy.org/en/latest/core/pooling.html for background info.
The bottom line of why I was able to reproduce the error is that the for loops
with the wget
command line in the comments above create 200 concurrent web requests.
The default connection Pool implementation in sqlalchemy will use at most 15
different
connections (so maximum 15 different web requests being processed). The default
timeout
a request waits when there are no more than connections available in the pool
is 30
seconds. So, as long as not all 200 concurrent requests are completed in 30
seconds, some
of them will timeout with error message:
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed
out
I'm wondering how come we sometimes have more than 15 concurrent requests on
llvm.org/perf?
At the moment, not that many users submit runs - which is the one long-ish
running request type?
BTW, invoking create_engine to allow unlimited number of connections to the
database to
be created also doesn't work, as then you can easily run into the maximum
number of
database connections the postgres server allows:
V4DB._engine[path] = sqlalchemy.create_engine(path, echo=echo, max_overflow=-1)
lets me trigger the following error:
OperationalError: (psycopg2.OperationalError) FATAL: remaining connection
slots are reserved for non-replication superuser connections
So, I guess the best setting for number of concurrent database connections
depends on
how many concurrent connections the underlying postgres server can handle?
Furthermore, it seems a mechanism should be used to not start more than that
number
of concurrent web requests? It seems with the builtin "lnt runserver --threaded"
command, the number of concurrent Request objects that are created is not
limited
in any way. Would limiting that on llvm.org/perf be a solution?
We did have an issue a while back where there was a deadlock in the submission code, I wonder if we are slowly deadlocking threads with close submissions.
I assume that llvm.org/perf is run with mod_wsgi, not the builtin web server. In that case it will be dependent on how the server is configured:
https://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
Also, I benchmarked our server yesterday. We do have some long requests. Submissions take 10s-15s. If there are enough happening all at once it could lock out the threads.
Daniel has kindly provided us with with mod_wsgi config:
# WSGI configuration for LLVM nightly test.
<IfModule mod_wsgi.c>
# The application entry point.
WSGIScriptAlias /perf "/opt/perf/lnt.wsgi"
# Run in daemon mode.
WSGIDaemonProcess llvm.org processes=2 threads=15 maximum-requests=10000
WSGIProcessGroup llvm.org
</IfModule>
There is also the following documentation on http://docs.sqlalchemy.org/en/latest/core/pooling.html#using-connection-pools-with-multiprocessing
"It’s critical that when using a connection pool, and by extension when using an Engine created via create_engine(), that the pooled connections are not shared to a forked process."
Not sure if this applies.
In the case of gunicorn, the connections are setup after the fork. I don’t know how mod_wsgi does it, I assume the same.
I recommend against increasing the timeout or number of concurrent connections. This will only increase the problem, with each process hanging the rest of the machine for minutes, hours.
There are two problems:
http://www.heatware.net/databases/how-to-find-log-slow-queries-postgresql/
Once we have that log, let it roll, and then pick up the slow queries and run them with explain. Should give us the reasons why they're slow and we will be able to fix.
Bots take at least 30 minutes to round up to submit again, so we'd have to have many more bots than we have now to have a timeout problem.
I vote for logging the slow queries first.
As an aside and just to complicate things, in r237188 I added the ability for you to apply more than one --submit flag. This allows results to be sent to more than one server, so now you don't have to choose. If one server fails, it still submits to the others, and LNT dies if all submissions fail.
I am going to allow the green dragon perf bots to submit to llvm.org/perf, and everyone is welcome to submit to http://llvm-lnt.herokuapp.com/db_default/submitRun.
http://llvm.org/viewvc/llvm-project?rev=248789&view=rev makes sure that
database connections do get closed explicitly when they are no longer
going to be used.
The ability to run the LNT unit tests on Postgres, introduced in r24878
demonstrated that the V4Pages unit test, when run on Postgres, consistently
triggered the following error message:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30
r248789 eliminates this error message by closing database connections at the
end of their intended life; which for webui pages, is the end of the web
request.
My non-perfect understanding is that the above error message on a web server
should now only occur when there are many many concurrent web requests going on.
Maybe this doesn't make a difference at all on an actual web server - depending
on what the web server-specific middleware does with database connections.
Anyway, closing the database connections was necessary when running the unit
tests on Postgres - and it may or may not ameliorate the situation on
production web servers.
Hi Kristof,
I'm still seeing a lot of timeout messages. Maybe the patch needs to be pushed by an LNT restart? I'm not sure how that works, if someone knows, please do so.