Open bpowell65536 opened 5 years ago
Nothing else in the worker log?
In general, during initial resync the percentages aren't always accurate at the moment. The important thing is that the replication advances. Unfortunately the only way to truly know that at the moment is through the database itself (or viewing the REST api the browser is using and the responses returned). Can you confirm the state remains exactly the same and not advancing at all?
I couldn't see anything else that looked relevant in the worker log, but for completeness here are the log messages that appear while the replication is in progress:
[2019-10-28 19:02:43,787: INFO/ForkPoolWorker-3] Raven is not configured (logging is disabled). Please see the documentation for more information.
[2019-10-28 19:02:43,797: INFO/ForkPoolWorker-3] Started
[2019-10-28 19:02:44,349: INFO/ForkPoolWorker-3] Task flask_app.tasks.migrations.start_live_migrations[2d278ed3-e3fe-4aec-925a-023fe4cbf903] succeeded in 0.5734206900233403s: None
[2019-10-28 19:03:04,282: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:03:04,286: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[b1f69e1b-7e64-466b-be82-0e1ec10581e9]
[2019-10-28 19:03:04,288: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[b1f69e1b-7e64-466b-be82-0e1ec10581e9] succeeded in 0.0010479899938218296s: None
[2019-10-28 19:04:04,313: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:04:04,319: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[f9786014-ed33-4509-80bc-aefe71a22700]
[2019-10-28 19:04:04,322: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[f9786014-ed33-4509-80bc-aefe71a22700] succeeded in 0.0008305439841933548s: None
[2019-10-28 19:04:32,948: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[4a978fb2-1523-4d8f-9b45-8704ff5661c4]
[2019-10-28 19:04:32,957: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[4a978fb2-1523-4d8f-9b45-8704ff5661c4] succeeded in 0.007307631021831185s: None
[2019-10-28 19:05:04,322: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:05:04,328: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[dc339c31-27a0-44c2-952f-e7fbf689246e]
[2019-10-28 19:05:04,330: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[dc339c31-27a0-44c2-952f-e7fbf689246e] succeeded in 0.0005721849738620222s: None
[2019-10-28 19:06:04,374: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:06:04,379: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[2240523c-c9d3-483c-9d6c-659617fedf31]
[2019-10-28 19:06:04,383: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[2240523c-c9d3-483c-9d6c-659617fedf31] succeeded in 0.0008473509806208313s: None
[2019-10-28 19:07:04,421: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:07:04,426: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[e5133d80-22c5-4788-a281-34fb4e337bd2]
[2019-10-28 19:07:04,432: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[e5133d80-22c5-4788-a281-34fb4e337bd2] succeeded in 0.0008454519556835294s: None
[2019-10-28 19:08:04,472: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:08:04,477: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[506042a3-a01e-4fe5-a805-4f6e9ae0753f]
[2019-10-28 19:08:04,481: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[506042a3-a01e-4fe5-a805-4f6e9ae0753f] succeeded in 0.001215384982060641s: None
[2019-10-28 19:09:04,523: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:09:04,532: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[8f6a62fb-d4e8-49fe-890a-3dd022dc01bd]
[2019-10-28 19:09:04,534: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[8f6a62fb-d4e8-49fe-890a-3dd022dc01bd] succeeded in 0.0006121189799159765s: None
(the replication task starts at 2019-10-28 19:04:32,948)
Yes, the replication state remains the same for the entire replication attempt.
I have looked at the Elasticsearch REST API, and the "backslash" index that should have been created for the test records does not exist. I tried creating that index manually, but no records are added to it.
I also tried adding the default Elasticsearch REST API credentials to the replication setup, but the behaviour does not change.
@vmalloc Please notice that the elastics-search version which @bpowell65536 uses is pretty old - Current (latest) version is 7.4.1. Can it be that the problem he is having is due to backwards incompatibility?
We run replication successfully with version 6.2.2.
Definitely - an old ES can absolutely cause issues with replication
Thanks for that info, @ayalash. I had seen in #570 that replication doesn't work with some versions of ES, but that issue didn't say which ones. I will try 6.2.2 and see how it goes.
I tried ES 6.2.2, but the behaviour is still the same. Which versions are supported? Is it just 6.2.2?
There is no clear statement as we didn't perform that many conformance tests, but I think >6 should be fine.
I think the only option is to look into the specific installation you have to see how (if at all) it progresses. Are you familiar with how to open a connection to Backslash's internal DB?
I can connect to the database, but I don't know much about Postgres so will have to figure it out as I go 😄. Are there any specific queries you want me to run?
jenkins@bpowell-nz-vl2:~$ psql -U backslash -h 127.0.0.1 -p 5432
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1), server 9.6.15)
Type "help" for help.
backslash=# \dt
List of relations
Schema | Name | Type | Owner
--------+----------------------+-------+-----------
public | alembic_version | table | backslash
public | app_config | table | backslash
public | background_migration | table | backslash
public | comment | table | backslash
public | entity | table | backslash
public | error | table | backslash
public | label | table | backslash
public | product | table | backslash
public | product_revision | table | backslash
public | product_version | table | backslash
public | replication | table | backslash
public | role | table | backslash
public | roles_users | table | backslash
public | run_token | table | backslash
public | session | table | backslash
public | session_entity | table | backslash
public | session_label | table | backslash
public | session_metadata | table | backslash
public | session_subject | table | backslash
public | subject | table | backslash
public | subject_instance | table | backslash
public | test | table | backslash
public | test_entity | table | backslash
public | test_information | table | backslash
public | test_metadata | table | backslash
public | test_variation | table | backslash
public | timing | table | backslash
public | user | table | backslash
public | user_preference | table | backslash
public | user_starred_tests | table | backslash
public | warning | table | backslash
(31 rows)
backslash=#
Let's start with several SELECT * FROM replication
, with about 30 minutes between each execution -- this will clarify whether or not there is any progress at all...
PLEASE NOTE that the output would probably need to be sanitized by you prior to pasting as it could contain passwords
Here are the contents of the replication
table, dumped every 30 minutes (the username and password fields did have the correct values for my ES instance in them, but I removed those).
id | service_type | url | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
13 | elastic-search | http://bpowell-cd-2536:9200 | | | | f | f | | | | 1572375905.53291 | | 0
(1 row)
id | service_type | url | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
13 | elastic-search | http://bpowell-cd-2536:9200 | | | | f | f | | | | 1572375905.53291 | | 0
(1 row)
id | service_type | url | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
13 | elastic-search | http://bpowell-cd-2536:9200 | | | | f | f | | | | 1572375905.53291 | | 0
(1 row)
id | service_type | url | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
13 | elastic-search | http://bpowell-cd-2536:9200 | | | | f | f | | | | 1572375905.53291 | | 0
(1 row)
id | service_type | url | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
13 | elastic-search | http://bpowell-cd-2536:9200 | | | | f | f | | | | 1572375905.53291 | | 0
(1 row)
Is the index_name
field supposed to be empty? I didn't provide that value when configuring the replication because the UI says it defaults to "backslash".
There doesn't seem to be any progress, but no records have been replicated either. There are definitely sessions in the database:
backslash=# select id,logical_id,start_time from session;
id | logical_id | start_time
----+----------------------------------------+------------------
1 | f65aa066-f6a2-11e9-a46a-782bcb9c4f51_0 | 1571951539.27069
2 | b07ef4ac-f6a7-11e9-a46a-782bcb9c4f51_0 | 1571953569.48793
3 | c27f7f2a-f9c0-11e9-a46a-782bcb9c4f51_0 | 1572294191.5634
(3 rows)
backslash=#
I ran another test which pushed results to Backslash to see if that would change the replication status, and while the new test session appeared in the session
table, the replication status did not change.
It looks like the replication hasn't even started... Is it possible your URL isn't reachable from within the worker container?
The ES URL is reachable from the machine that the worker container is running on, using ping
and also using curl
to access the ES REST interface. I started an interactive shell on the backslash_worker_1
container using docker exec -it backslash_worker_1 bash
and the ES URL was reachable from there too (using ping
and curl
again). Is there anything more specific I should try to see if the URL is reachable?
I tried running an ES Docker container on the same machine as the Backslash containers and replicating data to that, but the replication still doesn't start.
This is really strange. May I suggest adding some debug information inside the container around https://github.com/getslash/backslash/blob/af49b00328a18f34d8b4e642a2fc2f4879d25414/flask_app/tasks/replications.py#L335 ? We will need to pin down what the replication process is doing exactly and where it is stuck... There is also the slight chance that there is a flaw in the replication chunking logic (fetching the next bulk to be replicated) and that causes the issue...
I added some debug logging to the start of the do_elasticsearch_replication
function:
@reliable_task
@needs_app_context
def do_elasticsearch_replication(replica_id=None, reconfigure=True):
_logger.debug("replica_id is %r" % replica_id)
if replica_id is None:
for replica in models.Replication.query.all():
_logger.debug("do_elasticsearch_replication.apply_async for replica %r" % replica)
do_elasticsearch_replication.apply_async((replica.id,))
return
replica = models.Replication.query.get(replica_id)
_logger.debug("got replica %r for replica_id %r" % (replica, replica_id))
if replica is None:
_logger.debug(f'Replica {replica_id} already deleted. Not doing anything')
return
if replica.paused:
_logger.debug(f'Replica {replica_id} is paused')
return
sleep_time = 60
results = None
try:
es_index = index_from_replica(replica)
_logger.debug("es_index for replica %r is %r" % (replica, es_index))
if not es_index:
_logger.debug("No es_index, returning")
return
if reconfigure:
_logger.debug("Reconfiguring es_index")
es_index.reconfigure()
_logger.debug("estimating time left")
es_index.estimate_time_left()
start_time = replica.last_chunk_finished or flux.current_timeline.time()
results = [es_index.handle_result(result) \
for result in models.db.session.execute(get_next_bulk_query(es_index.get_fields_query(), replica, es_index))]
num_replicated, _ = es_helpers.bulk(es_index.get_client(), results)
_log.debug("num_replicated = %s" % num_replicated)
end_time = flux.current_timeline.time()
Creating and starting a new replication task now produces this log output from the worker container:
[2019-11-05 19:18:48,002: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[6c5cb05e-473e-455a-ac00-821b9c1be2fd]
[2019-11-05 19:18:48.024929] DEBUG: flask_app.tasks.replications: replica_id is 24
[2019-11-05 19:18:48.031890] DEBUG: flask_app.tasks.replications: got replica <Replica to http://bpowell-cd-2536:9200> for replica_id 24
[2019-11-05 19:18:48.032145] DEBUG: flask_app.tasks.replications: es_index for replica <Replica to http://bpowell-cd-2536:9200> is None
[2019-11-05 19:18:48.032241] DEBUG: flask_app.tasks.replications: No es_index, returning
[2019-11-05 19:18:48,032: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[6c5cb05e-473e-455a-ac00-821b9c1be2fd] succeeded in 0.024834798998199403s: None
This explains why the task gets done so fast, and possibly explains why it looks like it hasn't started, but I don't understand why es_index
is None
. Should es_index is None
really be a silent failure? I'll keep digging and see what I can find.
The es_index
value comes from the index_from_replica
function at https://github.com/getslash/backslash/blob/develop/flask_app/tasks/replications.py#L330. This function expects the index name (the value taken from the "Index name" field on the replication configuration page) to be either "test" or "session", and it returns an index of None
in all other cases. The replication configuration page says the index name defaults to "backslash", and based on that information I left it blank, which appears to have caused all the problems here. Creating one replication with the index name set to "test" and one with the index name set to "session" allows Backslash to successfully replicate all of its data to Elasticsearch, but it seems a bit non-intuitive. Perhaps the documentation around replication needs to be updated?
Good catch! That doesn't seem like the behavior we want - it sounds like a bug...
Thanks! Yeah, I was expecting that all the data would be replicated to a single "backslash" index if I left the index name field blank - is that the intended behaviour?
This sounds like a bug in the frontend/UI. I am guessing the "new" route sends an empty string, and since the API now accepts the index name this ends up as empty/null. @maormarcus can you confirm?
I have been trying to configure replication of a Backslash instance to an Elasticsearch instance, and the process seems straightforward, but I am having trouble getting it to work.
Backslash version: 2.17.2 (UI 2.17.2), running on Ubuntu 18.04.3, installed using the instructions at https://backslash.readthedocs.io/en/latest/installation.html Elasticsearch version: 5.4.3, run in a Docker container on Ubuntu 18.04.3 using
docker run -p 9200:9200 -p 9300:9300 -e discovery.type=single-node docker.elastic.co/elasticsearch/elasticsearch:5.4.3
I have seen that replication doesn't support the latest version of Elasticsearch (#570), but I can't find any information about which versions of Elasticsearch should be supported.
Steps I took to reproduce this issue:
The only log messages I can find related to this are:
Are there some more detailed instructions for configuring Backslash replication that I can follow besides https://backslash.readthedocs.io/en/latest/server_configuration.html#data-replication?