epermana / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
1 stars 0 forks source link

Replicator with parallel apply is kept ONLINE after reaching offline deferred request #598

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Setup a master and restore a backup into it, note the last seqno of the 
backup restore operation (X).
2. Setup a slave with 30 channels.
3. Put the slave online.
4. Instantly, put an offline-deferred request at seqno X.

What is the expected output?

Replicator goes OFFLINE at seqno X.

What do you see instead?

Replicator stops applying events at X-1, but is left ONLINE. Snipped from 
status:

appliedLastSeqno       : 287999
...
channels               : 30
...
offlineRequests        : Offline at sequence number: 288000
...
state                  : ONLINE

What is the possible cause?

Some seqno comparison logic mismatch that happens when both parallel apply and 
offline-deferred request are used.

What is the proposed solution?

1. Find a simple reproducible case.
2. Find & xix the cause and add tests.

Additional information

All transactions are big in this case. Last one has SQL(371) parts. In this 
case, it's MySQL to Oracle replication, but this doesn't seem to be at the root 
cause.

Original issue reported on code.google.com by linas.vi...@continuent.com on 20 May 2013 at 1:51

GoogleCodeExporter commented 9 years ago
Position of the stores:

serialized              : false
stopRequested           : false
store.0                 : THLParallelReadTask task_id=0 thread_name=store-thl-0 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=240 discarded=6785 events=0
store.1                 : THLParallelReadTask task_id=1 thread_name=store-thl-1 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 discarded=6793 events=0
store.10                : THLParallelReadTask task_id=10 
thread_name=store-thl-10 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=238 
discarded=6787 events=0
store.11                : THLParallelReadTask task_id=11 
thread_name=store-thl-11 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.12                : THLParallelReadTask task_id=12 
thread_name=store-thl-12 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.13                : THLParallelReadTask task_id=13 
thread_name=store-thl-13 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.14                : THLParallelReadTask task_id=14 
thread_name=store-thl-14 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.15                : THLParallelReadTask task_id=15 
thread_name=store-thl-15 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.16                : THLParallelReadTask task_id=16 
thread_name=store-thl-16 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.17                : THLParallelReadTask task_id=17 
thread_name=store-thl-17 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.18                : THLParallelReadTask task_id=18 
thread_name=store-thl-18 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.19                : THLParallelReadTask task_id=19 
thread_name=store-thl-19 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.2                 : THLParallelReadTask task_id=2 thread_name=store-thl-2 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=239 discarded=6786 events=0
store.20                : THLParallelReadTask task_id=20 
thread_name=store-thl-20 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.21                : THLParallelReadTask task_id=21 
thread_name=store-thl-21 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.22                : THLParallelReadTask task_id=22 
thread_name=store-thl-22 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.23                : THLParallelReadTask task_id=23 
thread_name=store-thl-23 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.24                : THLParallelReadTask task_id=24 
thread_name=store-thl-24 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.25                : THLParallelReadTask task_id=25 
thread_name=store-thl-25 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.26                : THLParallelReadTask task_id=26 
thread_name=store-thl-26 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.27                : THLParallelReadTask task_id=27 
thread_name=store-thl-27 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.28                : THLParallelReadTask task_id=28 
thread_name=store-thl-28 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.29                : THLParallelReadTask task_id=29 
thread_name=store-thl-29 hi_seqno=288000 lo_seqno=280976 read=7026 accepted=232 
discarded=6793 events=0
store.3                 : THLParallelReadTask task_id=3 thread_name=store-thl-3 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=239 discarded=6786 events=0
store.4                 : THLParallelReadTask task_id=4 thread_name=store-thl-4 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=239 discarded=6786 events=0
store.5                 : THLParallelReadTask task_id=5 thread_name=store-thl-5 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=238 discarded=6787 events=0
store.6                 : THLParallelReadTask task_id=6 thread_name=store-thl-6 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=238 discarded=6787 events=0
store.7                 : THLParallelReadTask task_id=7 thread_name=store-thl-7 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=238 discarded=6787 events=0
store.8                 : THLParallelReadTask task_id=8 thread_name=store-thl-8 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=238 discarded=6787 events=0
store.9                 : THLParallelReadTask task_id=9 thread_name=store-thl-9 
hi_seqno=288000 lo_seqno=280976 read=7026 accepted=238 discarded=6787 events=0
storeClass              : 
com.continuent.tungsten.replicator.thl.THLParallelQueue
syncInterval            : 10

Original comment by linas.vi...@continuent.com on 20 May 2013 at 1:55

GoogleCodeExporter commented 9 years ago
It seems that the replicator reached the right seqno, but it does not display 
it.

./cookbook/trepctl status
Processing status command...
NAME                     VALUE
----                     -----
appliedLastEventId     : mysql-bin.000002:0000000076111508;0
appliedLastSeqno       : 299999
appliedLatency         : 0.159
channels               : 10
clusterName            : default
currentEventId         : NONE
currentTimeMillis      : 1369059117597
dataServerHost         : host4
extensions             :
latestEpochNumber      : 0
masterConnectUri       : thl://host1:2112/
masterListenUri        : thl://host4:2112/
maximumStoredSeqNo     : 300000
minimumStoredSeqNo     : 0
offlineRequests        : Offline at sequence number: 300000
pendingError           : NONE
pendingErrorCode       : NONE
pendingErrorEventId    : NONE
pendingErrorSeqno      : -1
pendingExceptionMessage: NONE
pipelineSource         : thl://host1:2112/
relativeLatency        : 399.597
resourcePrecedence     : 99
rmiPort                : 10000
role                   : slave
seqnoType              : java.lang.Long
serviceName            : cookbook
serviceType            : local
simpleServiceName      : cookbook
siteName               : default
sourceId               : host4
state                  : ONLINE
timeInStateSeconds     : 816.999
transitioningTo        :
uptimeSeconds          : 817.76
version                : Tungsten Replicator 2.1.0 build 245
Finished status command...

# node: host4 - service: cookbook
+--------+-----------+-----------------+----------+---------------------+-------
--------------+
| seqno  | source_id | applied_latency | shard_id | update_timestamp    | 
extract_timestamp   |
+--------+-----------+-----------------+----------+---------------------+-------
--------------+
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 300000 | host1     |               0 | load4    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
| 299999 | host1     |               0 | load7    | 2013-05-20 16:05:18 | 
2013-05-20 16:05:18 |
+--------+-----------+-----------------+----------+---------------------+-------
--------------+

Original comment by g.maxia on 20 May 2013 at 2:31

GoogleCodeExporter commented 9 years ago

Original comment by g.maxia on 19 Jun 2013 at 5:27

GoogleCodeExporter commented 9 years ago
I have tested the issue again.  What is going on is the following:

1) One of the threads reaches the wanted seqno.
2) the replicator stops replicating (and this is good)
3) all the other threads display a seqno that is one less than the one we 
indicated for going offline (this is not good)
4) when we run trepctl status, it displays the minimum seqno, not the maximum 
that was reached. (Issue#513)
5) Because of the previous item, 'trepctl status' reports that it will stop at  
seqno X, but since it displays seqno X-1, it still shows the state as 'ONLINE' 
(and this is really bad).

Summing up, the replicator has gone offline, but it does not know it.
So, the fix for this issue is to make sure that the replicator looks at the 
maximum seqno currently in the channels, and use that one to determine the 
status. 

Original comment by g.maxia on 20 Jun 2013 at 10:42

GoogleCodeExporter commented 9 years ago
But is it safe to go offline when maximum seqno is reached? Won't we risk other 
channels having not applied some of the previous seqnos?

Original comment by linas.vi...@continuent.com on 20 Jun 2013 at 11:37

GoogleCodeExporter commented 9 years ago

Something must be happening inside the replicator that makes it stop pulling 
transactions from the THL.
If you look at "trepctl status -name shards", you will see that  all channels 
are in stage q-to-dbms.
But that happens when we notice that the replicator is not pulling data anymore.

If we run 'trepctl offline' on a busy server, the replicator consolidates all 
channels to the latest applied seqno and goes offline.  It seems that in this 
case the replicator is doing the same work, but for some reason it does not 
complete it.

Original comment by g.maxia on 20 Jun 2013 at 12:12

GoogleCodeExporter commented 9 years ago
Moving to 2.1.2

Original comment by linas.vi...@continuent.com on 21 Aug 2013 at 12:59

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 26 Aug 2013 at 1:54

GoogleCodeExporter commented 9 years ago
There won't be a 2.1.3.

Original comment by linas.vi...@continuent.com on 17 Sep 2013 at 10:13

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 18 Dec 2013 at 1:38

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 2 May 2014 at 9:14

GoogleCodeExporter commented 9 years ago
I'm unscheduling this issue, because it's not painful and we don't have an 
urgent need. I'm assigning it to Robert, as parallel apply author too.

Original comment by linas.vi...@continuent.com on 10 Sep 2014 at 1:00