Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LNT - 'lnt import' sometimes fails when importing data to SQLite #32566

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR33594
Status CONFIRMED
Importance P enhancement
Reported by Leandro Nunes (leandro.nunes@arm.com)
Reported on 2017-06-26 09:34:52 -0700
Last modified on 2017-06-28 10:28:36 -0700
Version unspecified
Hardware PC Linux
CC chris.matthews@apple.com, daniel@zuster.org, james.greenhalgh@arm.com
Fixed by commit(s)
Attachments traceback.txt (4048 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 18714
full traceback

We are observing a database access problem when importing data using "lnt
import" on a SQLite database.

The error message is (see the full traceback attached to this bug report):
sqlalchemy.exc.OperationalError: (OperationalError) database is locked

In a preliminary analysis, it looks like the reason is the async access done by
lnt.util.async_ops, that will create separate processes that will at some point
access the database.

This is ImportData:121 (rev. 306301):
async_ops.async_fieldchange_calc(db_name, ts, run, config)

So, it looks like the import profiles process is clashing with the regression
analysis in some sort of database race condition.
Quuxplusone commented 7 years ago

Attached traceback.txt (4048 bytes, text/plain): full traceback

Quuxplusone commented 7 years ago
With only a very rudimentary understanding of the LNT codebase, I have a few
observations.

This issue (crash caused by competition for the database lock) shows up when
the processing time for an import is increased.

Bisecting for the sources of increased import time, and therefore increased
chance of collision, I found that the most expensive change was r299139:

    Don’t use flask to fetch DB info

    When running from the command line, flask is not always active. To run
    background workers, take the DBConfig object directly, and use that
    to setup the DB connection in the workers.

I found that after this patch, some imports in to a large SQLite database were
taking greater than 30 seconds (up from 0.5 seconds before this patch/with this
patch reverted).

What I've observed is that after this path the "RIs" line is printed (i.e. the
print , while it wasn't before. Calculating the RIs seems to be the thing which
takes the most time. Walking back up the callchain, before this patch we never
executed any of identify_related_changes, regenerate_fieldchanges_for_run or
post_submit_tasks . That is to say, the call to async_fieldchange_calc was
always failing before doing any work.

Digging further in to that, the reason for the change in behaviour after your
patch is obvious. We were previously crashing fast with the call in to flask:

  Subprocess failed with:Traceback (most recent call last):
  File ".../lnt/util/async_ops.py", line 136, in async_wrapper
    _v4db = current_app.old_config.get_database(ts_args['db'])
  File ".../lib/python2.7/site-packages/werkzeug/local.py", line 343, in __getattr__
    return getattr(self._get_current_object(), name)
  File ".../lib/python2.7/site-packages/werkzeug/local.py", line 302, in _get_current_object
    return self.__local()
  File ".../lib/python2.7/site-packages/flask/globals.py", line 34, in _find_app
    raise RuntimeError('working outside of application context')
RuntimeError: working outside of application context

That gives us a good explanation for the slowdown! Previously we were just not
executing a huge chunk of code, now we are!

However, while obviously incorrect, the behaviour we had before your patch
worked just fine for our workflow, and allowed us to quickly import huge
numbers of runs (>3000). After your change, with some runs taking upwards of 30
seconds, that's no longer feasible. I appreciate that we are using LNT wrong,
but what are the extra benefits we get from the calculations in
identify_related_changes ? Would it be feasible to skip calculating them if we
are in a bulk import mode where we are less interested in run-to-run
information, and expect to pick that up from the web interface in future.

Chris, do you have any feedback or ideas?
Quuxplusone commented 7 years ago

It is intended that there is a lot of work done there. That is the reason it is done in an async background worker. I assume the issue you currently get is that too many workers are running? The submission itself is not gated on the workers finishing, but the LNT server will continue to churn for a while after the run submission.

I'd recommend you try again. I have made a number of major changes to that code path in the last two weeks. We were having server scale issues on our larger LNT servers, so I spent some time to optimize those code paths. They make significantly fewer queries now, and process much less data, and make fewer db commits.

The RI's that are being calculated are intended to be cached version of the changes LNT usually calculates when you look at the full run comparison. The problem they solve is when you want to list many runs on a page, you cannot be fetching hundreds of runs with thousands of samples on page load to calculate the deltas. Going forward I am hoping to push change data into more of the pages, for instance the order listing page, and the recents page and maybe even rebuilding the run page to use them so it loads faster. The nice thing about RIs is that we can pull really big working sets out of the database while doing the delta calculations. Hopefully that makes them more accurate.

I have a number of ideas to further speed up processing, I have made a bunch of schema changes locally that work great, but I have been putting of setting up a migration for them.

Could you let me know the breakdown of work in your system? The way I have been doing this is wrapping the async worker in a cProfiler, and having it print a profile file per worker.

There are two big time sinks for me, the RI calculation, which on my servers is now 50% bound on the mann-whitney calculations and 50% on DB operations, and the regression_evolution, which tends to be DB bound, and for me the slower of the two. Both of those are very different than a few weeks ago, where they were very DB bound.

I was also thinking about bringing the worker Pool back, I switched it to background processes; however, that means you can have a lot of them running at once when a lot of submissions come in.

Quuxplusone commented 7 years ago
Thanks for the detailed reply, I'm still seeing the slow imports with latest
trunk (r306363).

It might be best if I first describe our workflow, we're an abnormal user of
LNT, as we're using it for analysis of GCC against a large set of benchmarks
which run outside the LNT framework, and for which we have a large amount of
historical data. However, we've had good success in the past using the command
line import.

We prepare JSON files (at last count, more than 3000, each containing around
2400 tests) each night that represent all our historic benchmarking data. As we
might need to invalidate this data, and as it is currently completely decoupled
from LNT, we first delete our LNT database entirely, then import the full set
of 3000 runs. In other words:

  rm -rf /path/to/lnt
  lnt create /path/to/lnt
  lnt import --commit=1 /path/to/lnt /path/to/3000/runs/*

((As an aside, we're well aware that this is a huge and horrifying hack! What
we're really looking for is a way to handle the infrequent case that we need to
remove a run that we've declared "invalid". Something like

  lnt remove machine-name.$RUN

that purges the run with run_order $RUN for the appropriate machine would avoid
the hammer we're currently using))

That might give you some idea as to why a 10-60x more expensive import becomes
such a large problem for us, we feel that effect multiplied over 3000 runs!

I've tried gathering some more data for you on where we're spending time. I've
disabled the async workers by changing the call in ImportData to a direct call
to post_submit_tasks. Looking at an example of a typical slow run where the
analysis cost is low, here is the "standard" output:

  --- Tested: 2260 tests --
  Processing Times
  ----------------
  Load   : 0.00s
  Import : 0.13s
  Report : 11.43s
  Total  : 11.44s

  Imported Data
  -------------
  Added Machines: 1
  Added Runs    : 1
  Added Tests   : 325

  Results
  ----------------
  PASS : 2260

Digging in a little with cProfiler, query calls are taking up a huge portion of
the run time:

        1    0.000    0.000   11.003   11.003 <string>:1(<module>)
        1    0.000    0.000   11.003   11.003 commands.py:42(timed)
        1    0.096    0.096   11.002   11.002 fieldchange.py:58(regenerate_fieldchanges_for_run)
     2274    0.031    0.000    7.881    0.003 query.py:2339(one)
     2279    0.012    0.000    7.704    0.003 query.py:2399(__iter__)
     2279    0.014    0.000    7.130    0.003 query.py:2413(_execute_and_instances)
     2279    0.007    0.000    7.069    0.003 base.py:652(execute)

More precisely, these are the query calls in the inner loop of
regenerate_fieldchanges_for_run. That is to say, this query:

            try:
                f = ts.query(ts.FieldChange) \
                    .filter(ts.FieldChange.start_order == start_order) \
                    .filter(ts.FieldChange.end_order == end_order) \
                    .filter(ts.FieldChange.test_id == test_id) \
                    .filter(ts.FieldChange.machine == run.machine) \
                    .filter(ts.FieldChange.field == field) \
                    .one()
            except sqlalchemy.orm.exc.NoResultFound:
                f = None

Removing the cProfile wrapper to avoid the overhead it introduces, and adding
some timing calls around this query call to track the average time taken for a
query and the number of queries, it is clear these are expensive. In the below
dump, the outer loop is "for field in list(ts.Sample.get_metric_fields())" the
inner loop is "for test_id in runinfo.test_ids". I count a success as reaching
the end of the try block, a failure enters the except block.

  Outer loop: 5 iterations
  Inner loop: 2260 total iterations, 452 iterations per outer loop iteration
  Inner loop: 100% Failed queries (except block executed)
  Fail: Total time 3.1712474823 Average time 0.00140320685058

I would guess for a well-formed bulk import like ours (i.e. we are importing
all data for any given run in one go) we should never see those queries for
FieldChanges succeed.

Repeating the cProfiler experiment with the even more expensive runs shows a
cost which is roughly half spent on the queries above, and half in
identify_related_changes, as you predicted.
Quuxplusone commented 7 years ago
If you have command line access to the server, you can delete anything using
the updatedb command:

a run:
lnt updatedb --database={} --commit=1 --testsuite={} --delete-run={}
/etc/lnt/lnt.cfg

machine + all runs on it:
lnt updatedb --database={} --commit=1 --testsuite={} --delete-machine={}
/etc/lnt/lnt.cfg

All runs with an order + the order
lnt updatedb --database={} --commit=1 --testsuite={} --delete-order={}
/etc/lnt/lnt.cfg

Matthias and I are also working on a new REST API, from which you will also be
able to delete anything (with the right credentials).

In my local instances, I added an index on FieldChange.start_order,
ts.FieldChange.end_order and that made that query much much faster.  I think it
was defaulting to a full table scan.  I deferred writing a migration for that,
I was hoping to batch a bunch of my optimizations into one migration.
Migrations are a bother to write. I guess now is a good time to do that though.

Let me make a migration that adds that index for everyone, and we can profile
again.