Open Quuxplusone opened 7 years ago
Attached traceback.txt
(4048 bytes, text/plain): full traceback
With only a very rudimentary understanding of the LNT codebase, I have a few
observations.
This issue (crash caused by competition for the database lock) shows up when
the processing time for an import is increased.
Bisecting for the sources of increased import time, and therefore increased
chance of collision, I found that the most expensive change was r299139:
Don’t use flask to fetch DB info
When running from the command line, flask is not always active. To run
background workers, take the DBConfig object directly, and use that
to setup the DB connection in the workers.
I found that after this patch, some imports in to a large SQLite database were
taking greater than 30 seconds (up from 0.5 seconds before this patch/with this
patch reverted).
What I've observed is that after this path the "RIs" line is printed (i.e. the
print , while it wasn't before. Calculating the RIs seems to be the thing which
takes the most time. Walking back up the callchain, before this patch we never
executed any of identify_related_changes, regenerate_fieldchanges_for_run or
post_submit_tasks . That is to say, the call to async_fieldchange_calc was
always failing before doing any work.
Digging further in to that, the reason for the change in behaviour after your
patch is obvious. We were previously crashing fast with the call in to flask:
Subprocess failed with:Traceback (most recent call last):
File ".../lnt/util/async_ops.py", line 136, in async_wrapper
_v4db = current_app.old_config.get_database(ts_args['db'])
File ".../lib/python2.7/site-packages/werkzeug/local.py", line 343, in __getattr__
return getattr(self._get_current_object(), name)
File ".../lib/python2.7/site-packages/werkzeug/local.py", line 302, in _get_current_object
return self.__local()
File ".../lib/python2.7/site-packages/flask/globals.py", line 34, in _find_app
raise RuntimeError('working outside of application context')
RuntimeError: working outside of application context
That gives us a good explanation for the slowdown! Previously we were just not
executing a huge chunk of code, now we are!
However, while obviously incorrect, the behaviour we had before your patch
worked just fine for our workflow, and allowed us to quickly import huge
numbers of runs (>3000). After your change, with some runs taking upwards of 30
seconds, that's no longer feasible. I appreciate that we are using LNT wrong,
but what are the extra benefits we get from the calculations in
identify_related_changes ? Would it be feasible to skip calculating them if we
are in a bulk import mode where we are less interested in run-to-run
information, and expect to pick that up from the web interface in future.
Chris, do you have any feedback or ideas?
It is intended that there is a lot of work done there. That is the reason it is done in an async background worker. I assume the issue you currently get is that too many workers are running? The submission itself is not gated on the workers finishing, but the LNT server will continue to churn for a while after the run submission.
I'd recommend you try again. I have made a number of major changes to that code path in the last two weeks. We were having server scale issues on our larger LNT servers, so I spent some time to optimize those code paths. They make significantly fewer queries now, and process much less data, and make fewer db commits.
The RI's that are being calculated are intended to be cached version of the changes LNT usually calculates when you look at the full run comparison. The problem they solve is when you want to list many runs on a page, you cannot be fetching hundreds of runs with thousands of samples on page load to calculate the deltas. Going forward I am hoping to push change data into more of the pages, for instance the order listing page, and the recents page and maybe even rebuilding the run page to use them so it loads faster. The nice thing about RIs is that we can pull really big working sets out of the database while doing the delta calculations. Hopefully that makes them more accurate.
I have a number of ideas to further speed up processing, I have made a bunch of schema changes locally that work great, but I have been putting of setting up a migration for them.
Could you let me know the breakdown of work in your system? The way I have been doing this is wrapping the async worker in a cProfiler, and having it print a profile file per worker.
There are two big time sinks for me, the RI calculation, which on my servers is now 50% bound on the mann-whitney calculations and 50% on DB operations, and the regression_evolution, which tends to be DB bound, and for me the slower of the two. Both of those are very different than a few weeks ago, where they were very DB bound.
I was also thinking about bringing the worker Pool back, I switched it to background processes; however, that means you can have a lot of them running at once when a lot of submissions come in.
Thanks for the detailed reply, I'm still seeing the slow imports with latest
trunk (r306363).
It might be best if I first describe our workflow, we're an abnormal user of
LNT, as we're using it for analysis of GCC against a large set of benchmarks
which run outside the LNT framework, and for which we have a large amount of
historical data. However, we've had good success in the past using the command
line import.
We prepare JSON files (at last count, more than 3000, each containing around
2400 tests) each night that represent all our historic benchmarking data. As we
might need to invalidate this data, and as it is currently completely decoupled
from LNT, we first delete our LNT database entirely, then import the full set
of 3000 runs. In other words:
rm -rf /path/to/lnt
lnt create /path/to/lnt
lnt import --commit=1 /path/to/lnt /path/to/3000/runs/*
((As an aside, we're well aware that this is a huge and horrifying hack! What
we're really looking for is a way to handle the infrequent case that we need to
remove a run that we've declared "invalid". Something like
lnt remove machine-name.$RUN
that purges the run with run_order $RUN for the appropriate machine would avoid
the hammer we're currently using))
That might give you some idea as to why a 10-60x more expensive import becomes
such a large problem for us, we feel that effect multiplied over 3000 runs!
I've tried gathering some more data for you on where we're spending time. I've
disabled the async workers by changing the call in ImportData to a direct call
to post_submit_tasks. Looking at an example of a typical slow run where the
analysis cost is low, here is the "standard" output:
--- Tested: 2260 tests --
Processing Times
----------------
Load : 0.00s
Import : 0.13s
Report : 11.43s
Total : 11.44s
Imported Data
-------------
Added Machines: 1
Added Runs : 1
Added Tests : 325
Results
----------------
PASS : 2260
Digging in a little with cProfiler, query calls are taking up a huge portion of
the run time:
1 0.000 0.000 11.003 11.003 <string>:1(<module>)
1 0.000 0.000 11.003 11.003 commands.py:42(timed)
1 0.096 0.096 11.002 11.002 fieldchange.py:58(regenerate_fieldchanges_for_run)
2274 0.031 0.000 7.881 0.003 query.py:2339(one)
2279 0.012 0.000 7.704 0.003 query.py:2399(__iter__)
2279 0.014 0.000 7.130 0.003 query.py:2413(_execute_and_instances)
2279 0.007 0.000 7.069 0.003 base.py:652(execute)
More precisely, these are the query calls in the inner loop of
regenerate_fieldchanges_for_run. That is to say, this query:
try:
f = ts.query(ts.FieldChange) \
.filter(ts.FieldChange.start_order == start_order) \
.filter(ts.FieldChange.end_order == end_order) \
.filter(ts.FieldChange.test_id == test_id) \
.filter(ts.FieldChange.machine == run.machine) \
.filter(ts.FieldChange.field == field) \
.one()
except sqlalchemy.orm.exc.NoResultFound:
f = None
Removing the cProfile wrapper to avoid the overhead it introduces, and adding
some timing calls around this query call to track the average time taken for a
query and the number of queries, it is clear these are expensive. In the below
dump, the outer loop is "for field in list(ts.Sample.get_metric_fields())" the
inner loop is "for test_id in runinfo.test_ids". I count a success as reaching
the end of the try block, a failure enters the except block.
Outer loop: 5 iterations
Inner loop: 2260 total iterations, 452 iterations per outer loop iteration
Inner loop: 100% Failed queries (except block executed)
Fail: Total time 3.1712474823 Average time 0.00140320685058
I would guess for a well-formed bulk import like ours (i.e. we are importing
all data for any given run in one go) we should never see those queries for
FieldChanges succeed.
Repeating the cProfiler experiment with the even more expensive runs shows a
cost which is roughly half spent on the queries above, and half in
identify_related_changes, as you predicted.
If you have command line access to the server, you can delete anything using
the updatedb command:
a run:
lnt updatedb --database={} --commit=1 --testsuite={} --delete-run={}
/etc/lnt/lnt.cfg
machine + all runs on it:
lnt updatedb --database={} --commit=1 --testsuite={} --delete-machine={}
/etc/lnt/lnt.cfg
All runs with an order + the order
lnt updatedb --database={} --commit=1 --testsuite={} --delete-order={}
/etc/lnt/lnt.cfg
Matthias and I are also working on a new REST API, from which you will also be
able to delete anything (with the right credentials).
In my local instances, I added an index on FieldChange.start_order,
ts.FieldChange.end_order and that made that query much much faster. I think it
was defaulting to a full table scan. I deferred writing a migration for that,
I was hoping to batch a bunch of my optimizations into one migration.
Migrations are a bother to write. I guess now is a good time to do that though.
Let me make a migration that adds that index for everyone, and we can profile
again.
traceback.txt
(4048 bytes, text/plain)