LNT server crashes with 500 Internal Server Error under heavy load

Quuxplusone commented 9 years ago


Bugzilla Link	PR23474
Status	NEW
Importance	P normal
Reported by	Renato Golin (rengolin@gmail.com)
Reported on	2015-05-11 05:05:29 -0700
Last modified on	2016-07-11 00:12:49 -0700
Version	unspecified
Hardware	PC Linux
CC	chris.matthews@apple.com, daniel@zuster.org, danielgb@au1.ibm.com, kristof.beyls@arm.com, kristof.beyls@gmail.com, rengolin@gmail.com, tobias@grosser.es, tonic@nondot.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Collating all emails about the subject (Kristof, Chris, Daniel)

We have been looking a little bit at the potential underlying reason
for why http://llvm.org/perf/ is instable, and have found some clues. I want to
share them here to give people with more experience in the frameworks used by
LNT (flask, sqlalchemy, wsgi, …) a chance to check if our reasoning below seems
plausible.

Daniel noticed the following backtrace in the log after  http://llvm.org/perf
started giving “Internal Server Error” again:
2015-05-08 22:57:05,309 ERROR: Exception on /db_default/v4/nts/287/graph [GET]
[in /opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-
py2.7.egg/flask/app.py:1423]
Traceback (most recent call last):
  File "/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/venv/perf/lib/python2.7/site-packages/LNT-0.4.1dev-py2.7.egg/lnt/server/ui/decorators.py", line 67, in wrap
    result = f(**args)
  File "/opt/venv/perf/lib/python2.7/site-packages/LNT-0.4.1dev-py2.7.egg/lnt/server/ui/views.py", line 385, in v4_run_graph
    ts = request.get_testsuite()
  File "/opt/venv/perf/lib/python2.7/site-packages/LNT-0.4.1dev-py2.7.egg/lnt/server/ui/app.py", line 76, in get_testsuite
    testsuites = self.get_db().testsuite
  File "/opt/venv/perf/lib/python2.7/site-packages/LNT-0.4.1dev-py2.7.egg/lnt/server/ui/app.py", line 55, in get_db
    self.db = current_app.old_config.get_database(g.db_name, echo=echo)
  File "/opt/venv/perf/lib/python2.7/site-packages/LNT-0.4.1dev-py2.7.egg/lnt/server/config.py", line 148, in get_database
    return lnt.server.db.v4db.V4DB(db_entry.path, self, echo=echo)
  File "/opt/venv/perf/lib/python2.7/site-packages/LNT-0.4.1dev-py2.7.egg/lnt/server/db/v4db.py", line 108, in __init__
    .filter_by(id = lnt.testing.PASS).first()
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/query.py", line 2334, in first
    ret = list(self[0:1])
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/query.py", line 2201, in __getitem__
    return list(res)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/query.py", line 2405, in __iter__
    return self._execute_and_instances(context)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/query.py", line 2418, in _execute_and_instances
    close_with_result=True)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/query.py", line 2409, in _connection_from_session
    **kw)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/session.py", line 846, in connection
    close_with_result=close_with_result)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/session.py", line 850, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/orm/session.py", line 315, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/engine/base.py", line 1737, in contextual_connect
    self.pool.connect(),
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/pool.py", line 332, in connect
    return _ConnectionFairy._checkout(self)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/pool.py", line 630, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/pool.py", line 433, in checkout
    rec = pool._do_get()
  File "/opt/venv/perf/lib/python2.7/site-packages/SQLAlchemy-0.9.6-py2.7.egg/sqlalchemy/pool.py", line 945, in _do_get
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed
out, timeout 30

After browsing through the sqlalchemy documentation and bits of the LNT
implementation, it seems so far that the following pieces may be the key parts
that cause the problem shown in the log.

The SQLAlchemy documentation seems to recommend to have a sqlalchemy session
per web request. Looking at the following pieces of LNT, I got the impression
that instead a session is shared between many or all requests:

From ui/app.py, it shows Request.get_db() basically caches get_database from
"config":

...

class Request(flask.Request):

...

    def get_db(self):

...

        if self.db is None:

            echo = bool(self.args.get('db_log') or self.form.get('db_log'))

            self.db = current_app.old_config.get_database(g.db_name, echo=echo)
...

        return self.db

in config.py, it is shown that get_database returns a V4DB object by calling a
constructor:

...

    def get_database(self, name, echo=False):

...

        # Instantiate the appropriate database version.

        if db_entry.db_version == '0.4':

            return lnt.server.db.v4db.V4DB(db_entry.path, self,

                                           db_entry.baseline_revision,

                                           echo)
...

This constructor is in db/v4db.py:

...

class V4DB(object):

...

    def __init__(self, path, config, baseline_revision=0, echo=False):

...

        self.session = sqlalchemy.orm.sessionmaker(self.engine)()

...

        # Add several shortcut aliases.

        self.add = self.session.add

        self.commit = self.session.commit

        self.query = self.session.query

        self.rollback = self.session.rollback
...

It seems like a single session object is created in this constructor that will
ultimately be shared across all Requests. It seems that instead, the
request.get_db method should create a new session for each request. And close
that session when the request is finalized which probably needs to be done by
hooking into something Flask-specific.

The self.add and following lines in the constructor show that it probably will
be non-trivial to refactor code so that there will not be a single session per
v4db object.

We’re not sure if making separate sessions per Request is going to solve the
http://llvm.org/perf instability problems; but that’s the best idea we’ve got
so far.

It has been a long time since I looked at this, but aren't we just caching the
DB on the request object, thus ensuring there is one opened DB, and hence
session, per request? The V4DB is essentially just wrapping the session.

I have been dealing with issues like this a lot recently with our internal LNT.
Assuming this is the same issue, I’ll share what I have figured out so far.  I
run LNT from gunicorn, proxied with apache.  We were getting a lot of timeouts
on submissions, and exhausted database connections.  In our case, we are using
Postgres, and LNT would easily use all 100 connections Postgres creates by
default.  Apache like to mask these things by timing you out early.

I traced our issue back to the submitRun being extremely slow on about 1 in 5
submissions.  I get a feeling it was the submissions for particular machines
possibly some of the older machines, though I could not reproduce it reliably.
I found that with timeout removed, some submissions were taking 200s to
process, and a few were as long as 400s. I did this by putting print statements
in our production server and watching the submit requests being processed.  All
this time is spent generating the field changes, I did not dig in any further
than that.  What was happening is that as we were running more runs, we had
more and more submissions taking that long, and eventually the server would
start throwing errors like this, because all database connections were in use
doing submissions.

My solution: I jacked our timeouts way up (1000s) and upped our database
connections even further.  We do now process all our submissions, and I only
feel sort of dirty when I think about it.

The real solution: we have to find out what is going in in field change
generation that is sucking up so much time!  Submit run is interesting, it is
the only place in LNT were we write into the database, and we are doing a lot
of processing of the data when it comes in.

Quuxplusone commented 9 years ago

Daniel,

When database insert/update queries take too long, it's generally because there are too many indexes. MySQL is generally "less is more", but Postgres tends to be heavy handed with foreign keys and unique columns.

Another issue that makes write queries slow is the number of columns / accessory tables that are linked to the table you're updating. Maybe, if we could simplify the tables, and relax a bit the indexes, we should get back instantaneous updates.

I don't know much about Postgres, but MySQL has a query analyser, which gives you how the query is executed and what's slowing down, making it very easy to spot the slow key. If Postgres has the same, we should use it and change the query / table.

Quuxplusone commented 9 years ago

I've set up a local postgres server to try and replicate the issues seen on
llvm.org/perf

So far, I'm inserting semi-artificial run reports into the database.
Import the first run's data happened in about a second.
After importing about 1500 run reports, the import takes close to 15 seconds.

After configuring postgres to log all queries running for more than 250ms,
the following query shows up for each import of a run.
The IN-clause contains many runids. Maybe all runids in the database; or maybe
all runids for runs coming from the same machine?
After inserting data reflecting roughly 1500 LNT benchmarking runs, the query
returns 965700 rows. That does seem like a massive bandwidth bottleneck between
the postgres server and the LNT application.

2015-05-11 17:07:11 BST 127.0.0.1 5550d3ac.28dfLOG:  duration: 1511.224 ms
statement: SELECT "NT_Sample"."RunID" AS "NT_Sample_RunID",
"NT_Sample"."TestID" AS "NT_Sample_TestID", "NT_Sample".compile_status AS
"NT_Sample_compile_status", "NT_Sample".execution_status AS
"NT_Sample_execution_status", "NT_Sample".compile_time AS
"NT_Sample_compile_time", "NT_Sample".execution_time AS
"NT_Sample_execution_time", "NT_Sample".score AS "NT_Sample_score"
        FROM "NT_Sample"
        WHERE "NT_Sample"."RunID" IN (1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 416, 418, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 947, 948, 949, 950, 951, 952, 953, 954, 955, 957, 958, 959, 960, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092)

Quuxplusone commented 9 years ago

If my guess is correct that the IN list grows as the number of runs grows, than
we have a design problem.

What is the purpose of this select? We may not even need it, as it seems really
(*really*) generic.

Quuxplusone commented 9 years ago

The last few frames on the trace to the code that triggers the long-running
query is:

in lnt/server/db/testsuitedb.py:
...
    def _importSampleValues(self, tests_data, run, tag):
        ### adding samples
        lnt.server.db.fieldchange.regenerate_fieldchanges_for_run(self, run)

in lnt/server/db/fieldchange.py:
...
def regenerate_fieldchanges_for_run(ts, run):
    runs = ts.query(ts.Run).filter(ts.Run.order_id == run.order_id).all()
    previous_runs = ts.get_previous_runs_on_machine(run, 1)
    next_runs = ts.get_next_runs_on_machine(run, 1)
    ...
    # Load our run data for the creation of the new fieldchanges.
    runs_to_load = [r.id for r in (runs + previous_runs + next_runs)]
    runinfo = lnt.server.reporting.analysis.RunInfo(ts, runs_to_load)

# At this point runs_to_load is seemingly filled with the runids
# for all runs on that particular machine; leading to the big query
# in the comment above.

in lnt/server/reporting/analysis.py:
class RunInfo(object):
    def __init__(self, testsuite, runs_to_load,
                 aggregation_fn=stats.safe_min, confidence_lv=.05):
        self.testsuite = testsuite
        self.aggregation_fn = aggregation_fn
        self.confidence_lv = confidence_lv

        self.sample_map = util.multidict()
        self.loaded_run_ids = set()
        # Here, it loads all samples from all runs_to_load...
        self._load_samples_for_runs(runs_to_load)

I'm not sure what the intended functionality is of
regenerate_fieldchanges_for_run.

Quuxplusone commented 9 years ago

I think field change objects are stored in the database, and are used to optimize the reporting processes. There should be one field change per detected regression.

I don't think we need all samples to generate them, just recent samples.

Quuxplusone commented 9 years ago

It's possible that the complexity of adding new fields in the IN list are polynomial or even exponential, so keeping them small would be a good thing.

If the python module could list them and "tail 20", I think that would move times in the microseconds again.

Quuxplusone commented 9 years ago

I wonder if the N parameters are not working correctly? The number of runs that should find is 1 + 1 + runs on current rev. That should not be very many.

Quuxplusone commented 9 years ago

And what should be "runs on current rev"?

From Kristof's dump, it's about 1000 entries!

Quuxplusone commented 9 years ago

I think that should be all the runs with the same rev as the run just submitted. That should not be that many. In most cases under 10 i'd guess. It would depend on how many machines are submitting at that rev.

Quuxplusone commented 9 years ago

Ah, Kristof mentioned that he was submitting the same json over and over again, so it would have the same rev every time. So the slowdown is expected in that case. LNT will use all the data at the current in the field change calculation.

Now that I think of it, we had a bug in the rev naming which was causing all of our runs to get the same rev, and that is when our LNT server was hanging.

I have added a warning to the code in r237057 since I think this is almost always an error.

So, back to the original exception... why is the public server running out of db connections.

Quuxplusone commented 9 years ago

"Why is the server running out of DB connections?"
-> I don't know.

I couldn't reproduce the "TimeoutError: QueuePool limit of size
5 overflow 10 reached, connection timed out, timeout 30"
error on my local setup.

However, I've instrumented sqlalchemy with a few print-debug
statements to see if sqlalchemy sessions did get closed or not
when using the LNT website.
It's clear that the sessions are not closed. Therefore, my best
guess at the moment is that there may be a not-often used code
path that gets triggered when using some webviews that results
in the sessions to retain a connection from the connection pool.
It does seem that on the few webpages I've tried, the connection
used by a session got returned to the pool, even if the session
wasn't closed. ...or maybe this happens when a request times out...

Adding the following code does result in sessions being closed
when a web request is completed. I'm not all sure however if
doing this in a __del__ method is the right way to do it.

diff --git a/lnt/server/db/v4db.py b/lnt/server/db/v4db.py
index 9e61b49..b4796c9 100644
--- a/lnt/server/db/v4db.py
+++ b/lnt/server/db/v4db.py
@@ -151,6 +151,10 @@ class V4DB(object):
         return sum([ts.query(ts.Test).count()
                     for ts in self.testsuite.values()])

+    def close(self):
+        if self.session is not None:
+            self.session.close()
+
     def importDataFromDict(self, data, config=None):
         # Select the database to import into.
         #
diff --git a/lnt/server/ui/app.py b/lnt/server/ui/app.py
index 1660100..e86886d 100644
@@ -81,6 +83,11 @@ class Request(flask.Request):

         return self.testsuite

+    def __del__(self):
+        print "Request.__del__ called()"
+        if self.db is not None:
+            self.db.close()
+
 class App(flask.Flask):
     @staticmethod
     def create_with_instance(instance):

Quuxplusone commented 9 years ago

(In reply to comment #11)
> "Why is the server running out of DB connections?"
> -> I don't know.
>
> I couldn't reproduce the "TimeoutError: QueuePool limit of size
> 5 overflow 10 reached, connection timed out, timeout 30"
> error on my local setup.
>
> However, I've instrumented sqlalchemy with a few print-debug
> statements to see if sqlalchemy sessions did get closed or not
> when using the LNT website.
> It's clear that the sessions are not closed. Therefore, my best
> guess at the moment is that there may be a not-often used code
> path that gets triggered when using some webviews that results
> in the sessions to retain a connection from the connection pool.
> It does seem that on the few webpages I've tried, the connection
> used by a session got returned to the pool, even if the session
> wasn't closed. ...or maybe this happens when a request times out...
>
> Adding the following code does result in sessions being closed
> when a web request is completed. I'm not all sure however if
> doing this in a __del__ method is the right way to do it.

It does not seem like a big hack and probably also won't hurt. I would almost
propose to just commit it and see if it helps.

I also moved on Monday morning my own LNT testers to a private server and
llvm.org/perf is now back online (correlated?). I want to note that I have been
running these servers for a very long time (and report a run every 5 commits).
At the beginning llvm.org/perf was reasonable stable and only over time it
seemed to get slower. Maybe the problematic behaviour only shows up on really
large databases?

Renato any news from the error log? Have there been any errors since the
builbot restart last night (and my LNT buildslaves being moved away).

Interesting to see is that even though llvm.org/perf was offline for a long
time, it seems we have many/all? of the performance results in the db.

Quuxplusone commented 9 years ago

Pretty consistent, ~6 errors / hour until 3 hours ago. Since then, nothing.

Did you add the close and restarted?

Here's an example:

2015-05-12 06:30:16,721 ERROR: Exception on /db_default/v4/nts/20207 [GET] [in
/opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py:1423]
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed
out, timeout 30

Quuxplusone commented 9 years ago

I finally managed to reproduce the issue on a local setup using a postgres database as the backend. I haven't tried reproducing this with a sqlite backend.

One of the keys to reproduce this is to run the server multi-threaded, so multiple requests can be processed concurrently:

lnt runserver --threaded ./lnt_server_dirs/

Then doing concurrent web requests from the running server quickly results in the error message to be produced, and the server to start returning error 500:

$ for j in 1 2 3 4 5 6 7 8 9 10; do for i in 1 2 3 4 5 6 7 8 9; do echo $i; wget http://127.0.0.1:8000/db_default/v4/nts/machine/1 & done; done

This was on an instance with about 10K runs submitted, all from a single machine.

After running the above command line once or twice, I started seeing the TimeoutError. Even when the del statement to close the session (see my last comment above) has been inserted :(

Quuxplusone commented 9 years ago

(In reply to comment #14)
> $ for j in 1 2 3 4 5 6 7 8 9 10; do for i in 1 2 3 4 5 6 7 8 9; do echo $i;
> wget http://127.0.0.1:8000/db_default/v4/nts/machine/1 & done; done

Try 'ab' for that...

http://httpd.apache.org/docs/2.2/programs/ab.html

It's *really* good. :)

cheers,
--renato

Quuxplusone commented 9 years ago

(In reply to comment #13)
> Pretty consistent, ~6 errors / hour until 3 hours ago. Since then, nothing.

Interesting.

> Did you add the close and restarted?

The LNT server is restarted/reloaded on each commit to LNT. Kristof's patch was
not yet committed.

I only asked for a buildbot restart to stop my buildslaves to submit to
llvm.org/perf.

> Here's an example:
>
> 2015-05-12 06:30:16,721 ERROR: Exception on /db_default/v4/nts/20207 [GET]
> [in
> /opt/venv/perf/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.
> py:1423]
> TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
> timed out, timeout 30

I did not cross-check the exact timing, but there have indeed be submissions
that came in sufficiently close to probably be processed at the same time:

236969  2015-05-11 05:57:58     5:18:27     x86_64-penryn-O3-polly-detect-and-
dependences-only__clang_DEV__x86_64:38  View Results
236969  2015-05-11 05:57:57     5:18:18     x86_64-penryn-O3-polly-detect-
only__clang_DEV__x86_64:37  View Results

@Kristof: I have the feeling you are on the right track and it great we can
reproduce this outside of our production environment.

Quuxplusone commented 9 years ago

With a freshly created postgres database, with 1 json file for a LNT test-suite
run, I couldn't reproduce the error with the wget line in the comments above.
However by requesting a web page which presumably takes a little bit more
time to construct, I could reproduce the error:

for j in 1 2 3 4 5 6 7 8 9 10; do for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 17 18 19 20; do echo $i; wget -q -O /dev/null
http:///db_default/v4/nts/global_status & done; done

This was run on a database created by running the following 3 commands first:
lnt create --db-dir=postgresql://lnt:password@127.0.0.1/ --default-db=lnt4 --
show-sql ./lnt_server_dirs4/
lnt import --commit=1 ./lnt_server_dirs4/ 28.json
lnt runserver --threaded --show-db-pool --hostname=127.0.0.1 ./lnt_server_dirs4/

Quuxplusone commented 9 years ago

See http://docs.sqlalchemy.org/en/latest/core/pooling.html for background info.

The bottom line of why I was able to reproduce the error is that the for loops
with the wget
command line in the comments above create 200 concurrent web requests.

The default connection Pool implementation in sqlalchemy will use at most 15
different
connections (so maximum 15 different web requests being processed). The default
timeout
a request waits when there are no more than connections available in the pool
is 30
seconds. So, as long as not all 200 concurrent requests are completed in 30
seconds, some
of them will timeout with error message:
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed
out

I'm wondering how come we sometimes have more than 15 concurrent requests on
llvm.org/perf?
At the moment, not that many users submit runs - which is the one long-ish
running request type?

BTW, invoking create_engine to allow unlimited number of connections to the
database to
be created also doesn't work, as then you can easily run into the maximum
number of
database connections the postgres server allows:
    V4DB._engine[path] = sqlalchemy.create_engine(path, echo=echo, max_overflow=-1)
lets me trigger the following error:
OperationalError: (psycopg2.OperationalError) FATAL:  remaining connection
slots are reserved for non-replication superuser connections

So, I guess the best setting for number of concurrent database connections
depends on
how many concurrent connections the underlying postgres server can handle?
Furthermore, it seems a mechanism should be used to not start more than that
number
of concurrent web requests? It seems with the builtin "lnt runserver --threaded"
command, the number of concurrent Request objects that are created is not
limited
in any way. Would limiting that on llvm.org/perf be a solution?

Quuxplusone commented 9 years ago

We did have an issue a while back where there was a deadlock in the submission code, I wonder if we are slowly deadlocking threads with close submissions.

I assume that llvm.org/perf is run with mod_wsgi, not the builtin web server. In that case it will be dependent on how the server is configured:

https://code.google.com/p/modwsgi/wiki/ProcessesAndThreading

Quuxplusone commented 9 years ago

Also, I benchmarked our server yesterday. We do have some long requests. Submissions take 10s-15s. If there are enough happening all at once it could lock out the threads.

Quuxplusone commented 9 years ago

Daniel has kindly provided us with with mod_wsgi config:

# WSGI configuration for LLVM nightly test.
  <IfModule mod_wsgi.c>
    # The application entry point.
    WSGIScriptAlias /perf "/opt/perf/lnt.wsgi"

    # Run in daemon mode.
    WSGIDaemonProcess llvm.org processes=2 threads=15 maximum-requests=10000
    WSGIProcessGroup llvm.org
  </IfModule>

Quuxplusone commented 9 years ago

There is also the following documentation on http://docs.sqlalchemy.org/en/latest/core/pooling.html#using-connection-pools-with-multiprocessing

"It’s critical that when using a connection pool, and by extension when using an Engine created via create_engine(), that the pooled connections are not shared to a forked process."

Not sure if this applies.

Quuxplusone commented 9 years ago

In the case of gunicorn, the connections are setup after the fork. I don’t know how mod_wsgi does it, I assume the same.

Quuxplusone commented 9 years ago

I recommend against increasing the timeout or number of concurrent connections. This will only increase the problem, with each process hanging the rest of the machine for minutes, hours.

There are two problems:

Slow queries. We need to run explain / analyse on each one of them. Do we have a log of the queries, or just of the LNT errors? MySQL has a "slow queries" log, which helps a lot in identifying these issues. Postgres seems to have a similar utility:

http://www.heatware.net/databases/how-to-find-log-slow-queries-postgresql/

Once we have that log, let it roll, and then pick up the slow queries and run them with explain. Should give us the reasons why they're slow and we will be able to fix.

Too many connections. I believe that this is just a reflection of the queries being slow. 15 concurrent connections is quite a lot. For normal webpages, this is like hundreds of concurrent users. With LNT, may be 1-to-1, so 15 concurrent users, which we don't have. We have two. Maybe three.

Bots take at least 30 minutes to round up to submit again, so we'd have to have many more bots than we have now to have a timeout problem.

I vote for logging the slow queries first.

Quuxplusone commented 9 years ago

As an aside and just to complicate things, in r237188 I added the ability for you to apply more than one --submit flag. This allows results to be sent to more than one server, so now you don't have to choose. If one server fails, it still submits to the others, and LNT dies if all submissions fail.

I am going to allow the green dragon perf bots to submit to llvm.org/perf, and everyone is welcome to submit to http://llvm-lnt.herokuapp.com/db_default/submitRun.

Quuxplusone commented 9 years ago

http://llvm.org/viewvc/llvm-project?rev=248789&view=rev makes sure that
database connections do get closed explicitly when they are no longer
going to be used.

The ability to run the LNT unit tests on Postgres, introduced in r24878
demonstrated that the V4Pages unit test, when run on Postgres, consistently
triggered the following error message:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30

r248789 eliminates this error message by closing database connections at the
end of their intended life; which for webui pages, is the end of the web
request.

My non-perfect understanding is that the above error message on a web server
should now only occur when there are many many concurrent web requests going on.
Maybe this doesn't make a difference at all on an actual web server - depending
on what the web server-specific middleware does with database connections.
Anyway, closing the database connections was necessary when running the unit
tests on Postgres - and it may or may not ameliorate the situation on
production web servers.

Quuxplusone commented 9 years ago

Hi Kristof,

I'm still seeing a lot of timeout messages. Maybe the patch needs to be pushed by an LNT restart? I'm not sure how that works, if someone knows, please do so.

Quuxplusone / LLVMBugzillaTest

LNT server crashes with 500 Internal Server Error under heavy load #23474