Configuring data replication

bpowell65536 commented 5 years ago

I have been trying to configure replication of a Backslash instance to an Elasticsearch instance, and the process seems straightforward, but I am having trouble getting it to work.

Backslash version: 2.17.2 (UI 2.17.2), running on Ubuntu 18.04.3, installed using the instructions at https://backslash.readthedocs.io/en/latest/installation.html Elasticsearch version: 5.4.3, run in a Docker container on Ubuntu 18.04.3 using docker run -p 9200:9200 -p 9300:9300 -e discovery.type=single-node docker.elastic.co/elasticsearch/elasticsearch:5.4.3

I have seen that replication doesn't support the latest version of Elasticsearch (#570), but I can't find any information about which versions of Elasticsearch should be supported.

Steps I took to reproduce this issue:

Create a new replication (Admin > Replications > Add New...)
Enter the hostname and port number of the Elasticsearch server
Leave the index name and credentials fields blank
Click "Create"
- the replication status immediately shows "error" and "lagging"
Click "Start" for the new replication
- the replication status changes to "running" and "lagging"
- the progress bar starts at the ~40% position and does not move at all
- after about five minutes the replication stops, the replication status changes back to "error" and "lagging", and no test data has appeared in Elasticsearch

The only log messages I can find related to this are:

Oct 24 22:01:58 bpowell-nz-vl2 7042cafe1e8a[1140]: [2019-10-24 22:01:58,633: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[b0408a37-afd0-415e-89f4-bfb67b6603d2]
Oct 24 22:01:58 bpowell-nz-vl2 7042cafe1e8a[1140]: [2019-10-24 22:01:58,640: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[b0408a37-afd0-415e-89f4-bfb67b6603d2] succee
ded in 0.005087274999823421s: None

Are there some more detailed instructions for configuring Backslash replication that I can follow besides https://backslash.readthedocs.io/en/latest/server_configuration.html#data-replication?

vmalloc commented 5 years ago

Nothing else in the worker log?

In general, during initial resync the percentages aren't always accurate at the moment. The important thing is that the replication advances. Unfortunately the only way to truly know that at the moment is through the database itself (or viewing the REST api the browser is using and the responses returned). Can you confirm the state remains exactly the same and not advancing at all?

bpowell65536 commented 5 years ago

I couldn't see anything else that looked relevant in the worker log, but for completeness here are the log messages that appear while the replication is in progress:

[2019-10-28 19:02:43,787: INFO/ForkPoolWorker-3] Raven is not configured (logging is disabled). Please see the documentation for more information.
[2019-10-28 19:02:43,797: INFO/ForkPoolWorker-3] Started
[2019-10-28 19:02:44,349: INFO/ForkPoolWorker-3] Task flask_app.tasks.migrations.start_live_migrations[2d278ed3-e3fe-4aec-925a-023fe4cbf903] succeeded in 0.5734206900233403s: None
[2019-10-28 19:03:04,282: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:03:04,286: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[b1f69e1b-7e64-466b-be82-0e1ec10581e9]
[2019-10-28 19:03:04,288: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[b1f69e1b-7e64-466b-be82-0e1ec10581e9] succeeded in 0.0010479899938218296s: None
[2019-10-28 19:04:04,313: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:04:04,319: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[f9786014-ed33-4509-80bc-aefe71a22700]
[2019-10-28 19:04:04,322: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[f9786014-ed33-4509-80bc-aefe71a22700] succeeded in 0.0008305439841933548s: None
[2019-10-28 19:04:32,948: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[4a978fb2-1523-4d8f-9b45-8704ff5661c4]
[2019-10-28 19:04:32,957: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[4a978fb2-1523-4d8f-9b45-8704ff5661c4] succeeded in 0.007307631021831185s: None
[2019-10-28 19:05:04,322: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:05:04,328: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[dc339c31-27a0-44c2-952f-e7fbf689246e]
[2019-10-28 19:05:04,330: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[dc339c31-27a0-44c2-952f-e7fbf689246e] succeeded in 0.0005721849738620222s: None
[2019-10-28 19:06:04,374: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:06:04,379: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[2240523c-c9d3-483c-9d6c-659617fedf31]
[2019-10-28 19:06:04,383: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[2240523c-c9d3-483c-9d6c-659617fedf31] succeeded in 0.0008473509806208313s: None
[2019-10-28 19:07:04,421: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:07:04,426: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[e5133d80-22c5-4788-a281-34fb4e337bd2]
[2019-10-28 19:07:04,432: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[e5133d80-22c5-4788-a281-34fb4e337bd2] succeeded in 0.0008454519556835294s: None
[2019-10-28 19:08:04,472: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:08:04,477: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[506042a3-a01e-4fe5-a805-4f6e9ae0753f]
[2019-10-28 19:08:04,481: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[506042a3-a01e-4fe5-a805-4f6e9ae0753f] succeeded in 0.001215384982060641s: None
[2019-10-28 19:09:04,523: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:09:04,532: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[8f6a62fb-d4e8-49fe-890a-3dd022dc01bd]
[2019-10-28 19:09:04,534: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[8f6a62fb-d4e8-49fe-890a-3dd022dc01bd] succeeded in 0.0006121189799159765s: None

(the replication task starts at 2019-10-28 19:04:32,948)

Yes, the replication state remains the same for the entire replication attempt.

I have looked at the Elasticsearch REST API, and the "backslash" index that should have been created for the test records does not exist. I tried creating that index manually, but no records are added to it.

I also tried adding the default Elasticsearch REST API credentials to the replication setup, but the behaviour does not change.

ayalash commented 5 years ago

@vmalloc Please notice that the elastics-search version which @bpowell65536 uses is pretty old - Current (latest) version is 7.4.1. Can it be that the problem he is having is due to backwards incompatibility?

We run replication successfully with version 6.2.2.

vmalloc commented 5 years ago

Definitely - an old ES can absolutely cause issues with replication

bpowell65536 commented 5 years ago

Thanks for that info, @ayalash. I had seen in #570 that replication doesn't work with some versions of ES, but that issue didn't say which ones. I will try 6.2.2 and see how it goes.

bpowell65536 commented 5 years ago

I tried ES 6.2.2, but the behaviour is still the same. Which versions are supported? Is it just 6.2.2?

vmalloc commented 5 years ago

There is no clear statement as we didn't perform that many conformance tests, but I think >6 should be fine.

I think the only option is to look into the specific installation you have to see how (if at all) it progresses. Are you familiar with how to open a connection to Backslash's internal DB?

bpowell65536 commented 5 years ago

I can connect to the database, but I don't know much about Postgres so will have to figure it out as I go 😄. Are there any specific queries you want me to run?

jenkins@bpowell-nz-vl2:~$ psql -U backslash -h 127.0.0.1 -p 5432
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1), server 9.6.15)
Type "help" for help.

backslash=# \dt
                 List of relations
 Schema |         Name         | Type  |   Owner
--------+----------------------+-------+-----------
 public | alembic_version      | table | backslash
 public | app_config           | table | backslash
 public | background_migration | table | backslash
 public | comment              | table | backslash
 public | entity               | table | backslash
 public | error                | table | backslash
 public | label                | table | backslash
 public | product              | table | backslash
 public | product_revision     | table | backslash
 public | product_version      | table | backslash
 public | replication          | table | backslash
 public | role                 | table | backslash
 public | roles_users          | table | backslash
 public | run_token            | table | backslash
 public | session              | table | backslash
 public | session_entity       | table | backslash
 public | session_label        | table | backslash
 public | session_metadata     | table | backslash
 public | session_subject      | table | backslash
 public | subject              | table | backslash
 public | subject_instance     | table | backslash
 public | test                 | table | backslash
 public | test_entity          | table | backslash
 public | test_information     | table | backslash
 public | test_metadata        | table | backslash
 public | test_variation       | table | backslash
 public | timing               | table | backslash
 public | user                 | table | backslash
 public | user_preference      | table | backslash
 public | user_starred_tests   | table | backslash
 public | warning              | table | backslash
(31 rows)

backslash=#

vmalloc commented 5 years ago

Let's start with several SELECT * FROM replication, with about 30 minutes between each execution -- this will clarify whether or not there is any progress at all...

vmalloc commented 5 years ago

PLEASE NOTE that the output would probably need to be sanitized by you prior to pasting as it could contain passwords

bpowell65536 commented 5 years ago

Here are the contents of the replication table, dumped every 30 minutes (the username and password fields did have the correct values for my ES instance in them, but I removed those).

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

Is the index_name field supposed to be empty? I didn't provide that value when configuring the replication because the UI says it defaults to "backslash".

There doesn't seem to be any progress, but no records have been replicated either. There are definitely sessions in the database:

backslash=# select id,logical_id,start_time from session;
 id |               logical_id               |    start_time
----+----------------------------------------+------------------
  1 | f65aa066-f6a2-11e9-a46a-782bcb9c4f51_0 | 1571951539.27069
  2 | b07ef4ac-f6a7-11e9-a46a-782bcb9c4f51_0 | 1571953569.48793
  3 | c27f7f2a-f9c0-11e9-a46a-782bcb9c4f51_0 |  1572294191.5634
(3 rows)
backslash=#

I ran another test which pushed results to Backslash to see if that would change the replication status, and while the new test session appeared in the session table, the replication status did not change.

vmalloc commented 5 years ago

It looks like the replication hasn't even started... Is it possible your URL isn't reachable from within the worker container?

bpowell65536 commented 5 years ago

The ES URL is reachable from the machine that the worker container is running on, using ping and also using curl to access the ES REST interface. I started an interactive shell on the backslash_worker_1 container using docker exec -it backslash_worker_1 bash and the ES URL was reachable from there too (using ping and curl again). Is there anything more specific I should try to see if the URL is reachable?

I tried running an ES Docker container on the same machine as the Backslash containers and replicating data to that, but the replication still doesn't start.

vmalloc commented 5 years ago

This is really strange. May I suggest adding some debug information inside the container around https://github.com/getslash/backslash/blob/af49b00328a18f34d8b4e642a2fc2f4879d25414/flask_app/tasks/replications.py#L335 ? We will need to pin down what the replication process is doing exactly and where it is stuck... There is also the slight chance that there is a flaw in the replication chunking logic (fetching the next bulk to be replicated) and that causes the issue...

bpowell65536 commented 5 years ago

I added some debug logging to the start of the do_elasticsearch_replication function:

@reliable_task
@needs_app_context
def do_elasticsearch_replication(replica_id=None, reconfigure=True):
    _logger.debug("replica_id is %r" % replica_id)
    if replica_id is None:
        for replica in models.Replication.query.all():
            _logger.debug("do_elasticsearch_replication.apply_async for replica %r" % replica)
            do_elasticsearch_replication.apply_async((replica.id,))
        return

    replica = models.Replication.query.get(replica_id)
    _logger.debug("got replica %r for replica_id %r" % (replica, replica_id))
    if replica is None:
        _logger.debug(f'Replica {replica_id} already deleted. Not doing anything')
        return
    if replica.paused:
        _logger.debug(f'Replica {replica_id} is paused')
        return
    sleep_time = 60
    results = None
    try:
        es_index = index_from_replica(replica)
        _logger.debug("es_index for replica %r is %r" % (replica, es_index))
        if not es_index:
            _logger.debug("No es_index, returning")
            return
        if reconfigure:
            _logger.debug("Reconfiguring es_index")
            es_index.reconfigure()
        _logger.debug("estimating time left")
        es_index.estimate_time_left()
        start_time = replica.last_chunk_finished or flux.current_timeline.time()

        results = [es_index.handle_result(result) \
                   for result in models.db.session.execute(get_next_bulk_query(es_index.get_fields_query(), replica, es_index))]
        num_replicated, _ = es_helpers.bulk(es_index.get_client(), results)
        _log.debug("num_replicated = %s" % num_replicated)
        end_time = flux.current_timeline.time()

Creating and starting a new replication task now produces this log output from the worker container:

[2019-11-05 19:18:48,002: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[6c5cb05e-473e-455a-ac00-821b9c1be2fd]
[2019-11-05 19:18:48.024929] DEBUG: flask_app.tasks.replications: replica_id is 24
[2019-11-05 19:18:48.031890] DEBUG: flask_app.tasks.replications: got replica <Replica to http://bpowell-cd-2536:9200> for replica_id 24
[2019-11-05 19:18:48.032145] DEBUG: flask_app.tasks.replications: es_index for replica <Replica to http://bpowell-cd-2536:9200> is None
[2019-11-05 19:18:48.032241] DEBUG: flask_app.tasks.replications: No es_index, returning
[2019-11-05 19:18:48,032: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[6c5cb05e-473e-455a-ac00-821b9c1be2fd] succeeded in 0.024834798998199403s: None

This explains why the task gets done so fast, and possibly explains why it looks like it hasn't started, but I don't understand why es_index is None. Should es_index is None really be a silent failure? I'll keep digging and see what I can find.

bpowell65536 commented 5 years ago

The es_index value comes from the index_from_replica function at https://github.com/getslash/backslash/blob/develop/flask_app/tasks/replications.py#L330. This function expects the index name (the value taken from the "Index name" field on the replication configuration page) to be either "test" or "session", and it returns an index of None in all other cases. The replication configuration page says the index name defaults to "backslash", and based on that information I left it blank, which appears to have caused all the problems here. Creating one replication with the index name set to "test" and one with the index name set to "session" allows Backslash to successfully replicate all of its data to Elasticsearch, but it seems a bit non-intuitive. Perhaps the documentation around replication needs to be updated?

vmalloc commented 5 years ago

Good catch! That doesn't seem like the behavior we want - it sounds like a bug...

bpowell65536 commented 5 years ago

Thanks! Yeah, I was expecting that all the data would be replicated to a single "backslash" index if I left the index name field blank - is that the intended behaviour?

vmalloc commented 5 years ago

This sounds like a bug in the frontend/UI. I am guessing the "new" route sends an empty string, and since the API now accepts the index name this ends up as empty/null. @maormarcus can you confirm?

getslash / backslash

Configuring data replication #580