Closed Crapulax closed 7 years ago
Working on reproducing this issue. Sounds like a race in the one-shot flow where the backfill trigger is missed.
@Crapulax I could use some clarification on the sequence of replications you're running. From the output above, it sounds like a one-shot pull replication. From the discussions on the linked ticket (#1720), you're talking about one-shot push replications. Can you outline the set of replications you're running here?
As you mentioned, this may not be strictly related to the one-shot replication, but it would be helpful to know exactly what replications you're running.
Also, what build of Sync Gateway are you using?
Hello Adam,
Thanks for taking care of this issue.
Note that since we put this in place, and also when we run in above race issue, channel access is correctly granted as all documents sent are accepted by the sync function. I have digged further on the issue and you are fully right the backfill trigger is missed. I am not sure on how the different SG routines interacts but it looks like the channel update as new information get lost.
Also if just after the (Auth: Saved _sync:user:xxx trace) channels list is updated with current value then there is another issue. Because at that point we store in couchbase DB that channel XX sequence = current sequence while, if it is a new channel, old docs still need to be sent. If for whatever reason the SG is stopped at that point documents are lost. I do not want to look arrogant or whatever as I have just started digging into SG source code but it might be preferable ever not to store the cur seq for new channels here but the original start SEQ or just not the channel at all.
As an additional information we can easily reproduce the issue when the users accepting the invitation has just been created and when there are attachments involved. For attachments I guess that they are slowing down some functions for the new user maybe also. But I have no evidence at all that they are required to reproduce the issue.
If you need something else or more details, feel free I will be happy to help you helping us :)
Brgds
With the steps above, how are you ensuring that the updated invite document isn't getting sent by the continuous replication before you stop that replication in step 3? Are you actually stopping the replication before issuing the local save?
On your other question (about documents being lost if SG goes down) - if you think that's a legitimate issue, can you provide the detailed steps in a separate ticket? I don't follow all of the permutations, particularly around 'storing things in couchbase DB'. Replication is designed to handle and recover from failed SG nodes, so I don't think it's actually an issue, but can work through the details on another ticket if you like.
Hi Adam,
No indeed I am not ensuring that the document was not already replicated by the continuous replication. In fact I would have loved doing it this way but I have not figured out how to do so... I do not recall all details but the replication feedback (on _active_tasks) was not letting me know if there was or not pending document to replicates. In fact while writing this I am a bit embarassed not to remember why... Oh I think I get it, there might be, in some circumstances documents that are not, or cannot be replicated already, in the queue ( sync function failure for good reasons for instance or just pending network access). I cannot then determine if my invite is part of them or not.
When I run the one shot replication and whatever is the scenario, I know for sure that the doc(s) that were passed to it are replicated.
For the other potential issue I will check if what I suspect is correct or not. My only issue is that the channels updates from the source code seems to be done through huge indirections. But looking to DCP/Bucket logs might help here.
Even without the plumbing around the one shot replication we ran into the channel missing issue...
We ended up with a new workflow where one server checks if granting document has been received by couchbase cluster (not couchbase lite). Once client app receives the go from the server, documents requiring the channel access are updated by the client app to an 'online' status or freely created.
It is a pity to have to put such complex workflow for what looks like the basis of the replication.
Brgds, Steve.
Above workaround with server checking in couchbase does not work either. We still run into the issue. By the way issue is still there in SG 1.3
Attempting to identify whether this is a result of a race in a scenario like the following
If that's the issue, one-shot replication needs to be changed to not send any sequence values later than the value at the time the channel set was calculated. In the above example, this would result in sequence n+1 not being sent until the second changes request, at which point it would trigger the backfill as expected.
Happy to see the consideration you put on this issue. I can send you full below logs if you give me a private adress to use.
Just to avoid confusion and make sure we are on the same page. I was able to reproduce the issue, including in 1.3, AND without the one shot replication. One shot replication was used as a way to try to workaround the issue. It is not the case anymore as we were still running into the issue with it.
Our current implementation which fails time to time is the following: 1) server creates DOC1 document in SG granting channel X access to user A. This document is not added to channel X. Channel X contains many document already. 2) server creates document DOC 2 added in channel X. 3) client A receives document DOC 2 from server and all documents on channel X (when it works, not all when it does not). Client A is using a continuous replication. 4) client A update DOC2
When it works I can see the following in the logs (note the reloading user AND the New channels found after user reload): 1) Server Line 346488: **** begin DOC1 Line 346493: doc invitedUser_id:f915490c-5ac8-4127-afd8-db99590a2641 Line 346494: doc inviteStatus:created Line 346496: **** end DOC 1
2) Server
Line 346508: **** debut DOC2
Line 346512: doc inviteStatus:transient
Line 346514: doc inviteCode:
Line 346522: **** end DOC 2 ***
Line 346530: 2016-07-26T10:01:48.986Z Changes+: MultiChangesFeed reloading user &{roleImpl:{Name:f915490c-5ac8-4127-afd8-db99590a2641 ExplicitChannels: Channels_:!:1,TB=--JjbuTII_R53B23wjx2ONp:10865,TB=-U9Of4WH0JBngZ7tQZU_ZmB:10871,TB=-rEJQWDPZW05-_Xzr1Rt9N:10858,TU=28f4c7eb-72f5-472c-b0cf-e2880b72fc05:10871,TU=e86c6e00-59b3-4630-a511-1c969a9bffbf:10858,TU=f2e19d0d-0c07-4889-a664-3cf03d47cc9f:10865,U=139:10854 Sequence:10853 PreviousChannels:} userImplBody:{Email: Disabled:false PasswordHash:[36 50 97 36 49 48 36 65 52 111 79 73 75 108 105 48 51 89 65 85 85 82 50 57 102 112 86 111 101 121 109 86 72 74 115 104 99 117 105 81 108 75 48 89 97 47 100 105 47 113 56 85 103 97 85 87 97 50 56 71] OldPasswordHash:
3) 4) Client Line 347135: **** begin update DOC2 Line 347139: doc inviteStatus:accepted Line 347141: doc inviteCode: Line 347150: **** end update DOC2 ***
When it does not work I can see the following in the logs (note the reloading user is missing after DOC 1&2 get created): 1) Server Line 341580: **** begin DOC1 Line 341585: doc invitedUser_id:f915490c-5ac8-4127-afd8-db99590a2641 Line 341586: doc inviteStatus:created Line 341588: **** end DOC1
2) Server Line 341600: **** begin DOC2 Line 341604: doc inviteStatus:transient Line 341606: doc inviteCode: Line 341614: **** end DOC2 ***
3) 4) Client
Line 342143: **** begin update DOC2
Line 342147: doc inviteStatus:accepted
Line 342149: doc inviteCode:
Line 342156: **** end update DOC 2
Line 342206: 2016-07-26T10:00:56.910Z Changes+: MultiChangesFeed reloading user &{roleImpl:{Name:f915490c-5ac8-4127-afd8-db99590a2641 ExplicitChannels: Channels_:!:1,TB=-rEJQWDPZW05-_Xzr1Rt9N:10858,TU=e86c6e00-59b3-4630-a511-1c969a9bffbf:10858,U=139:10854 Sequence:10853 PreviousChannels:} userImplBody:{Email: Disabled:false PasswordHash:[36 50 97 36 49 48 36 65 52 111 79 73 75 108 105 48 51 89 65 85 85 82 50 57 102 112 86 111 101 121 109 86 72 74 115 104 99 117 105 81 108 75 48 89 97 47 100 105 47 113 56 85 103 97 85 87 97 50 56 71] OldPasswordHash:
Brgds, Steve.
@Crapulax That's useful information - thanks for sharing. What version of Couchbase Server are you using? In the scenario as described above, the problem seems to be that the view query that identifies the new channel assignment isn't picking up Doc1 (the document doing the grant).
When running against Couchbase Server 3.0 and later, this shouldn't be possible - Sync Gateway is issuing a stale=false view, which will force reindexing and should always see Doc1.
There may be an issue when running against earlier server versions, however - previously stale=false queries had additional requirements that Sync Gateway didn't guarantee.
Hello Adam,
I do not have the exact version in mind but it is for sure 4+. If you recall my original email above, the issue was that flag triggeredby was not set when it does not work:
Here it works:
Line 13745: 2016-06-11T15:54:16.200Z Cache: getCachedChanges("BB=-AHVbFZo38w8oZpLwHBsi_5", {0 27015 0 0
So, and this is just an intuition, the issue looks like to be linked with query parameters more than query result.
I am going to spend some time today to reproduce the issue and get some more logs. Will try to get a curl based scenario which will ease your analysis.
Thanks, Steve.
Hi Again,
I am able to reproduce the issue with curl only and basic documents! Note that i had to keep my servers running to ease the reproduction. In this context they were just doing a _changes since on each do received. I guess that their impact is only to make the SG a bit more busy.
Hereby all steps to do so whatever environment: 1) script testsg: (you start it with a number greater than last step executed on previous run) This script creates in a loop (executed 30 times): a)10 users b)1 user doc per user granting channel access to user x U=userx c) Starts 1 listener on _changes , continuous, timeout 18s for each user with expected number of docs to receive using script listenSG. d) create 10 docs on channel xxx as user1 e) create 1 granting document as user1 for each user This doc give users x created in a) access to channel xxx create 1 document added to channel U=userx and TM=userx f) Listeners started in c) output an error message if number of document is not the expected one
scripts output With above I am having errors all the time during one or more of the 30 iterations. For instance: waiting for end of users creation listen for users in step 408 waiting for listen starts create user docs for step 408 create docs in channel mychannel408 create grant doc for users step 408 Wait for ends of listens create user step 409 waiting for end of users creation *** error on user user4069 got 3 docs!!! *** listen for users in step 409 waiting for listen starts create user docs for step 409 create docs in channel mychannel409 create grant doc for users step 409 Wait for ends of listens create user step 410 waiting for end of users creation listen for users in step 410 waiting for listen starts create user docs for step 410 create docs in channel mychannel410 create grant doc for users step 410 Wait for ends of listens create user step 411 *** error on user user4083 got 3 docs!!! *** *** error on user user4089 got 3 docs!!! *** *** error on user user4084 got 3 docs!!! *** *** error on user user4087 got 3 docs!!! *** *** error on user user4085 got 3 docs!!! *** *** error on user user4088 got 12 docs!!! *** *** error on user user4086 got 3 docs!!! *** *** error on user user4082 got 3 docs!!! *** waiting for end of users creation listen for users in step 411 waiting for listen starts create user docs for step 411 create docs in channel mychannel411 create grant doc for users step 411 Wait for ends of listens create user step 412 waiting for end of users creation *** error on user user4097 got 3 docs!!! *** *** error on user user4092 got 12 docs!!! *** listen for users in step 412 waiting for listen starts create user docs for step 412 create docs in channel mychannel412 create grant doc for users step 412
SCRIPTS
TESTSG: $ more testSG DB="mydb" k=$1 end=$(($k + 30))
while test $k -ne $end do
i=0 echo "create user step $k" while test $i != 10 do curl -sS -H "Content-type: application/json" -X PUT http://localhost:4985/$DB/_user/user$k$i -d "{\"password\":\"user\"}" > /dev/null i=$(($i + 1)) done
echo 'waiting for end of users creation' sleep 2
i=0 echo "listen for users in step $k" while test $i != 10 do ./listenSG user$k$i 13 & i=$(($i + 1)) done echo 'waiting for listen starts'
i=0 echo "create user docs for step $k" while test $i != 10 do curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docuser$k$i -d "{\"type\":\"testuser\",\"toto\":\"tata\",\"userid\":\"user$k$i\"}" > /dev/null i=$(($i + 1)) done
i=0 echo "create docs in channel mychannel$k" while test $i != 10 do curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docstep$k-$i -d "{\"type\":\"test\",\"toto\":\"tata\",\"channel\":\"mychannel$k\"}" > /dev/null i=$(($i + 1)) done
i=0 echo "create grant doc for users step $k" while test $i != 10 do curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docgrntuser$k$i -d "{\"type\":\"testgrant\",\"toto\":\"tata\",\"channel\":\"mychannel$k\",\"userid\":\"user$k$i\"}" > /dev/null curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docinvuser$k$i -d "{\"type\":\"testinvite\",\"toto\":\"tata\",\"channel\":\"mychannel$k\",\"userid\":\"user$k$i\"}" > /dev/null i=$(($i + 1)) done
echo "Wait for ends of listens"
k=$(($k + 1)) done
listenSG:
DB="mydb" curl -H "Content-type: application/json" -sS -o /tmp/sgres/res$1 --user $1:user -X GET http://localhost:4984/$DB/_changes?feed=continuous&timeout=18000
nbres=grep changes /tmp/sgres/res$1 | wc -l
if test $nbres -ne $2
then
echo "*** error on user $1 got $nbres docs!!! ***"
fi
SG function function toto(doc, oldDoc) { if (doc.type=== 'test' || doc.type === 'testinvite' || doc.type==='testgrant' || doc.type==='testuser'){ if (doc.type === 'test') { channel('TB='+doc.channel); } if (doc.type === 'testgrant') { access(doc.userid, 'TB='+doc.channel); access(doc.userid, 'TU='+doc.userid); } if (doc.type === 'testuser') { channel('U='+doc.userid); access(doc.userid, 'U='+doc.userid); } if (doc.type === 'testinvite') { channel('TM='+doc.channel); channel('U='+doc.userid); }
return; } }
@Crapulax I've tried running this curl test against SG 1.3.0/CBS 4.1.1. and haven't been able to reproduce the failure - I've run multiple runs of 30 (including several concurrent runs of 30) and haven't been able to generate the *********************** error on user...
message. It's possible that my Sync Gateway isn't under the same load as yours - I'm not sure what your comment "Note that i had to keep my servers running to ease the reproduction." implies from a load perspective.
Based on your scripts, I will add a unit test that attempts the same scenario, and run that with golang's race detector enabled to try to shake out any issues. However, if you've got any additional details on your Sync Gateway/CBS setup that I might need in order to reproduce, that would help.
Hello Adam,
I have 5 servers in GO connected as ADMIN and listening to _changes. When doing above tests there was no other activity. These documents with type=== test* are discarded by them. It is true that when my servers were stopped I did not reproduce the issue but I only did 2 or 3 (30 runs) without. So conclusion wether they are required or not would be dangerous but indeed GW load seems to help to trigger the issue.
I am currently running all this on an AWS C3 instance, by the way a machine with several cores under Ubuntu 14.
I can grant you an access to my SG if you give me a private IP to authorize, if it helps? Alternatively I have kept full SG logs when issue occured as described in above post.
Let me know what I can do to help. I really want to get ride of this issue. Which will be probably easy to fix once understood.
Brgds, Steve.
PS: note that we can exchange files or whatever through private emails it might ease commuication. I will not bother you and forget your email address once issue is understood.
I'd like to be able to reproduce the issue internally - it's the best way to ensure it's part of our test suite and doesn't get broken in future.
Here's the information that would be helpful from your run, for failing user[nnn]:
HTTP: #492: GET /mydb/_changes?feed=continuous&timeout=18000 (as user[nnn])
to the line
Changes: MultiChangesFeed done (to user[nnn])
I'm going to be out of the office for a few days, so might not get time to review until early next week. Apologies for the delay.
Hi Adam,
I am sending you a grep result on 'user4069|docstep406-' and 'user4088|docstep408-'. I can send you full content but there are about 7K lines for each...
Here it is with two failing cases: CASE 1 For user 4069, 3 docs instead of 13
egrep -n 'user4069|docstep406-' /tmp/sgissues
191410:2016-08-23T10:36:47.278Z HTTP: #40880: PUT /tili/_user/user4069 (ADMIN)
191523:2016-08-23T10:36:50.774Z HTTP: #40886: GET /tili/changes?feed=continuous&timeout=18000 (as user4069)
191526:2016-08-23T10:36:50.775Z Changes: MultiChangesFeed({}, {Since:0 Limit:0 Conflicts:false IncludeDocs:false Wait:true Continuous:true Terminator:0xc822106060 HeartbeatMs:0 TimeoutMs:18000 ActiveOnly:false}) ... (to user4069)
191527:2016-08-23T10:36:50.775Z Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"!":channels.VbSequence{VbNo:(_uint16)(nil), Sequence:0x1}} ... (to user4069)
191528:2016-08-23T10:36:50.778Z Changes+: MultiChangesFeed sending &{Seq:35855 ID:_user/user4069 Deleted:false Removed:{} Doc:map[] Changes:[] Err:
CASE 2 For user 4088 with 12 docs instead of 13
egrep -n 'user4088|docstep408-' /tmp/sgissues
195544:2016-08-23T10:37:03.825Z HTTP: #41465: PUT /tili/_user/user4088 (ADMIN)
195561:2016-08-23T10:37:07.048Z Changes+: Notifying that "tili" changed (keys="{_sync:user:user4088}") count=298
195834:2016-08-23T10:37:07.422Z Changes+: MultiChangesFeed sending &{Seq:35955 ID:_user/user4088 Deleted:false Removed:{} Doc:map[] Changes:[map[rev:]] Err:
And the listen results for user4069 and user4088 are:
more /tmp/sgres/resuser4069 {"seq":35855,"id":"_user/user4069","changes":[]}
{"seq":35865,"id":"docuser4069","changes":[{"rev":"1-42c1fd1f6e28f431ccc31c4466d5caea"}]} {"seq":35895,"id":"docinvuser4069","changes":[{"rev":"1-a81327d77c8153d6fa6de9be0ecc2fa1"}]}
more /tmp/sgres/resuser4088 {"seq":35955,"id":"_user/user4088","changes":[]}
{"seq":"35993:35967","id":"docstep408-0","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35968","id":"docstep408-1","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35969","id":"docstep408-2","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35970","id":"docstep408-3","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35971","id":"docstep408-4","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35972","id":"docstep408-5","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35973","id":"docstep408-6","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35974","id":"docstep408-7","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35975","id":"docstep408-8","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]} {"seq":"35993:35976","id":"docstep408-9","changes":[{"rev":"1-11bf8824d4d3f95114166716a392cbfb"}]}
{"seq":35994,"id":"docinvuser4088","changes":[{"rev":"1-899638aca91b611bf896f3b54f2a7e10"}]}
Enjoy your break!
Hi Again,
I am not able to run into the issue if my servers are not running. BUT with below script executed in parallel of above test I can! This script simulates what my servers do (listen and get docs from SG). The issue is then reproducible in a totally isolated environment.
The script takes two parameters $1
./serverListen 45000 10
more ./serversListen
DB="xxxx"
if [ -z $3 ]; then
trap "exit" INT TERM
trap 'kill 0' EXIT
echo "starting servers listeners"
echo '$1: since $2 nb servers'
echo "Searching last doc since $1"
realSince=curl -H "Content-type: application/json" -sS -X GET http://localhost:4985/$DB/_changes?feed=longpoll\&since=$1 | grep last_seq | cut -d"\"" -f4
nbServers=1
while test $nbServers != $2
do
./$0 $1 $2 $nbServers $realSince &
nbServers=$(($nbServers + 1))
done
./$0 $1 $2 $nbServers $realSince &
echo "ALl listeners started, CTRL-C to kill all"
sleep 10000
else echo "Server $3 listen started since $4" fi
realSince=$4 echo "Starting servers from $realSince"
i=0 while test $i != 1 do
doc=curl -H "Content-type: application/json" -sS -X GET http://localhost:4985/$DB/_changes?feed=longpoll\&since=$realSince\&limit=3
realSince=echo $doc| grep last_seq| sed 's/.*\"last_seq/\"last_seq/' | cut -d"\"" -f4
docid=echo $doc| cut -d"\"" -f8
if [ -z echo $docid | grep _user
]; then
curl -H "Content-type: application/json" -sS -X GET http://localhost:4985/$DB/$docid > /dev/null
fi
done
Below scripts increase occurence of the issue. What is added compare to above is a get, by users, on documents just created.
You start scripts:
./serverListen
New script: SGget more SGget
DB="xxxx"
res=""
while test -z "$res"
do
res=curl -sS -H "Content-type: application/json" --user $2:user -X GET http://localhost:4984/$DB/$1 | grep -v error
sleep 0.2
done
Modified script testSG: more testSG DB="xxxx" trap "exit" INT TERM trap 'kill 0' EXIT
k=$1 end=$(($k + 30))
while test $k -ne $end do
i=0 echo "create user step $k" while test $i != 10 do curl -sS -H "Content-type: application/json" -X PUT http://localhost:4985/$DB/_user/user$k$i -d "{\"password\":\"user\"}" > /dev/null i=$(($i + 1)) done
echo 'waiting for end of users creation' sleep 2
i=0 echo "listen for users in step $k" while test $i != 10 do ./listenSG user$k$i 13 $k-$i& i=$(($i + 1)) done
i=0 echo "create user docs for step $k" while test $i != 10 do curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docuser$k$i -d "{\"type\":\"testuser\",\"toto\":\"tata\",\"userid\":\"user$k$i\"}" > /dev/null i=$(($i + 1)) done
i=0 echo "create docs in channel mychannel$k" while test $i != 10 do curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docstep$k-$i -d "{\"type\":\"test\",\"toto\":\"tata\",\"channel\":\"mychannel$k\"}" > /dev/null ./SGget docstep$k-$i user$k$i& ./SGget docstep$k-$i user$k$i& ./SGget docstep$k-$i user$k$i& ./SGget docstep$k-$i user$k$i& ./SGget docstep$k-$i user$k$i& i=$(($i + 1)) done
i=0 echo "create grant doc for users step $k" while test $i != 10 do curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docgrntuser$k$i -d "{\"type\":\"testgrant\",\"toto\":\"tata\",\"channel\":\"mychannel$k\",\"userid\":\"user$k$i\"}" > /dev/null curl -sS -H "Content-type: application/json" --user user1:user1 -X PUT http://localhost:4984/$DB/docinvuser$k$i -d "{\"type\":\"testinvite\",\"toto\":\"tata\",\"channel\":\"mychannel$k\",\"userid\":\"user$k$i\"}" > /dev/null i=$(($i + 1)) done
echo "Wait for ends of listens"
k=$(($k + 1)) done
@Crapulax I've been working on this, and have been able to reproduce what looks like the same issue using a unit test based on your original script logic. I'm still working on the fix - will keep you posted.
Excellent!
Once you have a possible fix (whatever is its form source code, commit,..), do not hesitate to let me know. I will be happy to test it with my different scenarios including in our alpha production environment. I had like to check that what we observed during normal usage is indeed what I was able to reproduce...
Thanks!
@Crapulax I've found one scenario that reproduces your symptoms, but I'm not yet 100% convinced that this is your exact issue.
I've pushed the fix to the feature/issue_1865 branch. If you're able to build from there and let me know how it works for you, I'd appreciate it. If you need any additional assistance getting a build to test with, let me know.
Hello Adam,
I will add 'tomorrow' the few lines of your fix in my 1.3 version and let you know. It will take me 1mn as I can already compile SG. I assume that no other changes are required compare to plain 1.3. If not let me know and will get the full package instead.
Crossing fingers!
Hello Adam,
Unfortunately I still run into the issue with my tests and your fix. Issue occurence is identical with or without the fix. I am going to dig into the log and let you know my findings.
Brgds, Steve.
Hi,
I think I have progressed on the issue. It looks like there is a race condition which can occur between the ca
Hi,
I think I have progressed on the issue. It looks like there is a race condition which can occur between the cache query and update of last since.
When it does not work we can see that the cache is fed JUST AFTER the cache query took place:
Query the cache:
Line 26673: 2016-09-02T12:50:23.739Z Changes+: MultiChangesFeed ,is U=user19014 new channel :%!d(bool=true) for ... (to user19014)
Line 26674: 2016-09-02T12:50:23.739Z Cache: Initialized cache for channel "U=user19014" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
Line 26675: 2016-09-02T12:50:23.739Z Cache: getCachedChanges("U=user19014", {0 82132 0 0
Data added in the cache: Line 26735: 2016-09-02T12:50:23.748Z Cache: #82132 ==> channel "U=user19014" Line 26738: 2016-09-02T12:50:23.750Z Changes+: Notifying that "xxxx" changed (keys="{, U=user19013, U=user19014}") count=76 Line 26750: 2016-09-02T12:50:23.751Z Cache: Received *#82157* after 3293ms ("docinvuser19014" / "1-291bcbdb8fd551cea61e344bbc8f73d0") Line 26908: 2016-09-02T12:50:23.759Z Cache: #82157 ==> channel "U=user19014" Line 26911: 2016-09-02T12:50:23.759Z Changes+: Notifying that "xxxx" changed (keys="{, TB=mychannel1901, TM=mychannel1901, U=user19010, U=user19011, U=user19012, U=user19013, U=user19014, U=user19019}") count=82
No row found... Line 27205: 2016-09-02T12:50:23.936Z Cache: Got no rows from view for "U=user19014"
Next query starts from 82133 i.e. 82132 is lost
Line 27590: 2016-09-02T12:50:24.027Z Cache: getCachedChanges("U=user19014", {0 82156 0 82147
@Crapulax Is the previous description for a continuous replication? Because a continuous replication handles new channel backfill regardless of sequence number, based on the 'addedChannels' processing: https://github.com/couchbase/sync_gateway/blob/master/db/changes.go#L535
Also, it's not clear from your log excerpts which docs 82157 and 82132. Posting the full logs from line 26673 through 27590 to a gist may help.
I don't think there's a case where newly added channels aren't caught by a continuous replication. I've been trying to find a scenario where notification might be getting missed (so that the changes feed never does the check for added channels) but haven't reproduced your scenario yet.
I would really like to see full logs for a failing changes request from start to finish (non-pruned) - if you can share those in a gist, I think it would help speed this up.
Hello Adam,
1) Yes it is on continuous replication. I guess I used badly the world since I should have said last_seq
Line 24544: 2016-09-02T12:50:17.739Z HTTP: #807: GET /tili/_changes?feed=continuous&timeout=18000 (as user19014)
We are exactly on the block you sent in changes.go
**break loop waitForChanges on default case (added log):**
Line 26463: 2016-09-02T12:50:23.730Z Changes+: waiter default (to user19014)
**reload user**
Line 26464: 2016-09-02T12:50:23.730Z Changes+: MultiChangesFeed reloading user &{roleImpl:{Name_:user19014 ExplicitChannels_: Channels_:!:1 Sequence_:82122 PreviousChannels_:} userImplBody:{Email_: Disabled_:false PasswordHash_:[36 50 97 36 49 48 36 86 78 117 98 48 104 75 54 112 87 121 89 86 54 50 65 90 107 72 76 117 79 98 73 109 89 89 49 115 76 120 118 98 122 66 88 116 114 57 111 100 90 75 76 67 119 78 72 97 66 57 113 46] OldPasswordHash_:<nil> ExplicitRoles_: RolesSince_: OldExplicitRoles_:[]} auth:0xc82066f720 roles:[]}
Line 26523: 2016-09-02T12:50:23.732Z Cache: Received #82122 ("_user/user19014")
Line 26669: 2016-09-02T12:50:23.739Z Bucket: Update("_sync:user:user19014", 0, ...) --> CAS quit [9.617891ms]
**new channels found**
Line 26670: 2016-09-02T12:50:23.739Z Changes+: New channels found after user reload: {TB=mychannel1901, TU=user19014, U=user19014} for user user19014
Line 26671: 2016-09-02T12:50:23.739Z Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"U=user19014":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x140d4}, "!":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x1}, "TB=mychannel1901":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x140ec}, "TU=user19014":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x140ec}} ... (to user19014)
Line 26672: 2016-09-02T12:50:23.739Z Changes+: MultiChangesFeed lowseq:0 for ... (to user19014)
Line 26673: 2016-09-02T12:50:23.739Z Changes+: MultiChangesFeed ,is U=user19014 new channel :%!d(bool=true) for ... (to user19014)
**get changes from cache and boom**
Line 26674: 2016-09-02T12:50:23.739Z Cache: Initialized cache for channel "U=user19014" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
2) doc #82132 is docuser19014
Line 25070: 2016-09-02T12:50:18.157Z CRUD: Stored doc "docuser19014" / "1-906365ef544d2ab2904177da877464c4"
Line 26430: 2016-09-02T12:50:23.728Z Cache: Received #82132 after 5583ms ("docuser19014" / "1-906365ef544d2ab2904177da877464c4")
Line 26432: 2016-09-02T12:50:23.728Z Cache: Received #82156 after 3545ms ("docgrntuser19014" / "1-ef0eba56d43c9fbcbffb280947466a62")
3) I am going to create a gist with all data
4) I have added some logs to the SG which are usually prefixed with a tab: Changes+:<\t>MultiChangesFeed
5) I am going to verify further my theory but looking to the code and logs what I see is this, look to these lines:
SG search for docs in channel "U=user19014" cache gets doc added to channel while search has started (Cache: #82132 ==> channel "U=user19014") SG found no doc in channel ( Cache: GetChangesInChannel("U=user19014") --> 0 rows)
Line 26674: 2016-09-02T12:50:23.739Z Cache: Initialized cache for channel "U=user19014" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
Line 26675: 2016-09-02T12:50:23.739Z Cache: getCachedChanges("U=user19014", {0 82132 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 0 changes valid from #82065
Line 26676: 2016-09-02T12:50:23.742Z Cache: Querying 'channels' view for "U=user19014" (start=#1, end=#82065, limit=0)
Line 26735: 2016-09-02T12:50:23.748Z Cache: #82132 ==> channel "U=user19014"
Line 26738: 2016-09-02T12:50:23.750Z Changes+: Notifying that "tili" changed (keys="{*, U=user19013, U=user19014}") count=76
Line 26750: 2016-09-02T12:50:23.751Z Cache: Received #82157 after 3293ms ("docinvuser19014" / "1-291bcbdb8fd551cea61e344bbc8f73d0")
Line 26908: 2016-09-02T12:50:23.759Z Cache: #82157 ==> channel "U=user19014"
Line 27205: 2016-09-02T12:50:23.936Z Cache: Got no rows from view for "U=user19014"
Line 27206: 2016-09-02T12:50:23.936Z Cache: GetChangesInChannel("U=user19014") --> **0 rows**
for user19013 who get all its documents I have: cache gets doc added to channel BEFORE search has started (Cache: #82131 ==> channel "U=user19013") SG search for docs in channel "U=user19013" SG found docs in channel (Cache: GetChangesInChannel("U=user19013") --> 2 rows)
Cache is filled
Line 26725: 2016-09-02T12:50:23.748Z Cache: Initialized cache for channel "U=user19013" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
Line 26726: 2016-09-02T12:50:23.748Z Cache: #82131 ==> channel "U=user19013"
Line 26902: 2016-09-02T12:50:23.759Z Cache: #82155 ==> channel "U=user19013"
Then the getCache which returns the 2 rows:
Line 27868: 2016-09-02T12:50:24.037Z Cache: getCachedChanges("U=user19013", {0 82131 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 2 changes valid from #82065
Line 27869: 2016-09-02T12:50:24.037Z Cache: Querying 'channels' view for "U=user19013" (start=#1, end=#82065, limit=0)
Line 28284: 2016-09-02T12:50:24.069Z Cache: Got no rows from view for "U=user19013"
Line 28285: 2016-09-02T12:50:24.069Z Cache: GetChangesInChannel("U=user19013") --> **2 rows**
Line 28287: 2016-09-02T12:50:24.069Z Changes+: Sending seq:82131 from channel U=user19013
Line 28288: 2016-09-02T12:50:24.069Z Changes+: Sending seq:82155 from channel U=user19013
gist: https://gist.github.com/Crapulax/c9ec4962d46193b6c84e1afa74a3d9b8
In these logs there are errors on user19014 12 docs instead of 13 more /tmp/sgres/resuser19014 {"seq":82122,"id":"_user/user19014","changes":[]} {"seq":"82156:82138","id":"docstep1901-0","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82139","id":"docstep1901-1","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82140","id":"docstep1901-2","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82141","id":"docstep1901-3","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82142","id":"docstep1901-4","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82143","id":"docstep1901-5","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82144","id":"docstep1901-6","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82145","id":"docstep1901-7","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82146","id":"docstep1901-8","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82156:82147","id":"docstep1901-9","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":82157,"id":"docinvuser19014","changes":[{"rev":"1-291bcbdb8fd551cea61e344bbc8f73d0"}]}
user 19011 3 docs instead of 13 more /tmp/sgres/resuser19011 {"seq":82119,"id":"_user/user19011","changes":[]} {"seq":82129,"id":"docuser19011","changes":[{"rev":"1-83335033104e795909df5fadd14da200"}]} {"seq":82151,"id":"docinvuser19011","changes":[{"rev":"1-7ac90cb9eabac7ad418e5d05d0c04d8c"}]}
user 19019 12 docs instead of 13 more /tmp/sgres/resuser19019 {"seq":82127,"id":"_user/user19019","changes":[]} {"seq":"82166:82138","id":"docstep1901-0","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82139","id":"docstep1901-1","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82140","id":"docstep1901-2","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82141","id":"docstep1901-3","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82142","id":"docstep1901-4","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82143","id":"docstep1901-5","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82144","id":"docstep1901-6","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82145","id":"docstep1901-7","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82146","id":"docstep1901-8","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82166:82147","id":"docstep1901-9","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]}
Full expected content from user19013: more /tmp/sgres/resuser19013 {"seq":82121,"id":"_user/user19013","changes":[]} {"seq":82131,"id":"docuser19013","changes":[{"rev":"1-5166e5db4bd266f7617c6db4eda26905"}]} {"seq":"82154:82138","id":"docstep1901-0","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82139","id":"docstep1901-1","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82140","id":"docstep1901-2","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82141","id":"docstep1901-3","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82142","id":"docstep1901-4","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82143","id":"docstep1901-5","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82144","id":"docstep1901-6","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82145","id":"docstep1901-7","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82146","id":"docstep1901-8","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":"82154:82147","id":"docstep1901-9","changes":[{"rev":"1-30671085d1e9fa25056e2ec92fcfa6bf"}]} {"seq":82155,"id":"docinvuser19013","changes":[{"rev":"1-c25200565149dec351f6197776c51caf"}]} {"seq":82167,"id":"docinvuser19019","changes":[{"rev":"1-aa1c47dc7289ddff1312f9a9e6f16807"}]}
I checked above theory on user19011 who get only 3 docs and lose all docstep documents. Looking to the log I see that (where doc #82150 is the granting doc docgrntuser19011 for user 19011 - I use this # to identify users in getCachedChanges requests as var cacheValidFrom):
3 'users' make a getcachedchanges. Two of them get their documents through the 2nd getCachedChanges mechanism while the first user (user19011) get nothing. user19011 locks the view and call it, result is 0 rows. In the meantime cache gets filled and when the two other users get the view lock they get from the 2nd chance the docs. Again, as with user19014 case, missing docs #82138 to #82147 were added in cache between the getCachedChanges and the end of the function with log GetChangesInChannel.
extract obtain with regex on notepad++:
GetChangesInChannel\("TB|getCachedChanges\("TB|channel "TB=mychannel1901"|"TB=mychannel1901"[ $]
Query the cache from user19011
Line 26683: 2016-09-02T12:50:23.743Z Cache: Initialized cache for channel "TB=mychannel1901" with options: &{ChannelCacheMinLength:50 ChannelCacheMaxLength:500 ChannelCacheAge:1m0s}
Line 26684: 2016-09-02T12:50:23.743Z Cache: getCachedChanges("TB=mychannel1901", {0 82150 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 0 changes valid from #82065
user19011 locks the view
Line 26685: 2016-09-02T12:50:23.743Z Cache: Querying 'channels' view for "TB=mychannel1901" (start=#1, end=#82065, limit=0)
Users ('82162' and '82154') are locked on the view get lock
Line 26693: 2016-09-02T12:50:23.743Z Cache: getCachedChanges("TB=mychannel1901", {0 82162 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 0 changes valid from #82065
Line 26838: 2016-09-02T12:50:23.758Z Cache: getCachedChanges("TB=mychannel1901", {0 82154 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 0 changes valid from #82065
doc added to the cache in the meantime
Line 26849: 2016-09-02T12:50:23.758Z Cache: #82138 ==> channel "TB=mychannel1901"
Line 26852: 2016-09-02T12:50:23.758Z Cache: #82139 ==> channel "TB=mychannel1901"
Line 26855: 2016-09-02T12:50:23.758Z Cache: #82140 ==> channel "TB=mychannel1901"
Line 26858: 2016-09-02T12:50:23.758Z Cache: #82141 ==> channel "TB=mychannel1901"
Line 26861: 2016-09-02T12:50:23.758Z Cache: #82142 ==> channel "TB=mychannel1901"
Line 26864: 2016-09-02T12:50:23.758Z Cache: #82143 ==> channel "TB=mychannel1901"
Line 26867: 2016-09-02T12:50:23.758Z Cache: #82144 ==> channel "TB=mychannel1901"
Line 26870: 2016-09-02T12:50:23.758Z Cache: #82145 ==> channel "TB=mychannel1901"
Line 26873: 2016-09-02T12:50:23.758Z Cache: #82146 ==> channel "TB=mychannel1901"
Line 26876: 2016-09-02T12:50:23.759Z Cache: #82147 ==> channel "TB=mychannel1901"
getting results for some other users, view lock was released
Line 26968: 2016-09-02T12:50:23.763Z Cache: getCachedChanges("TB=mychannel1901", {0 82148 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 10 changes valid from #82065
Line 27209: 2016-09-02T12:50:23.936Z Cache: getCachedChanges("TB=mychannel1901", {0 82156 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 10 changes valid from #82065
game over for user19011
Line 27212: 2016-09-02T12:50:23.939Z Cache: GetChangesInChannel("TB=mychannel1901") --> 0 rows
lucky users ('82162' and '82154') get unlocked and got their docs
Line 27217: 2016-09-02T12:50:23.939Z Cache: 2nd getCachedChanges("TB=mychannel1901", {0 82162 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) got 10 more, valid from #1!
Line 27222: 2016-09-02T12:50:23.939Z Cache: 2nd getCachedChanges("TB=mychannel1901", {0 82154 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) got 10 more, valid from #1!
Line 27245: 2016-09-02T12:50:23.941Z Cache: getCachedChanges("TB=mychannel1901", {0 82160 0 0 <nil> <nil> %!d(string=) 0 0 0 %!d(string=)}) --> 10 changes valid from #1
I think I have a beginning of comprehension
And now i have a correction!! Well it is probably more a kind of hack. But at least I do not have missing document anymore .... :) Going to make a pull request for it. Basically, the options.since in the changes.go main loop is applied to all channels. But as stated above, because of deferred you might not have processed the document you beleive you have processed. My hack consists in:
Yes!
Well I messed up with pull request I let you clean it up but you have all changes in the db.zip file
After further analysis of the logs, I think this is actually just the same issue we discussed originally - we've finally got detailed logs to illustrate it.
Here's the relevant timestamps for user19011 from the logs:
The 10 docs in TB=mychannel1901
are assigned sequences #82138-82147.
12:50:23.743Z - user19011's channels include to U=user19011
, TB=mychannel1901
12:50:23.743Z - get changes for channel TB=mychannel1901
... at this point the cache is only up to 82131, so no docs are found for this channel...
Now the docs we're interested in get cached: 12:50:23.758Z Cache: #82138 ==> channel "TB=mychannel1901" ... 12:50:23.759Z Cache: #82151 ==> channel "U=user19011"
12:50:23.939Z - get changes for channel U=user19011
... this finds the recently cached #82151, and rolls ahead the sequence value, which means we don't go back and retrieve the entries from TB=mychannel1901
in a subsequent iteration.
I think the correct fix is to identify the latest cached sequence value before processing any channels, and ensure no values are returned in that iteration exceeding that value. I'll work on that fix, and get you a test build as soon as I can.
Hello Adam,
The deferred pending queue and cache works as expected from my findings. I have not verified it fully but indeed the wait loop break because of user updates. I was not seeing the notify from the cache. The user reload use triggerby #seqid which can be in the pendinglog queue. If docs are still in pending queue, getcache can return 0 row. Let's imagine that has this point deferred queue is moved to cache. Another channel in the SAME main loop does a getcache and get its document. When the feed aggregation is computed the options.Since (shared by all channels) get set to a sequence value bigger than sequence of missed document of the first get cache of another channel. End of the game document will never be sent.
Our mails cross eah other:) Yes! And this somehow what I did in my 'fix', reducing the since to the max cached sequence and keeping track of the max since by channel. Anyway I am looking forward for the real fix and happy to validate it! I do not know since when this issue exists but it was a nice one:)
Hi Again,
While running with my 'correction' installed on our alpha system we have run again into the missing doc issue. I fully understand that this is only my fix and it might not correctly resolve above issue. In the meantime it is worth noting that this time the issue looks a bit different. What is happenning is that the expand channel does not set the newly granted channel as new... And by the way backfill does not take place. Note that this might also be a regression from what I did... Anyway I was willing to let you know, on my side I dig further in logs to figure out more precisely what happens.
Thanks, Steve.
Forget what I wrote on expand this is not the point...
@Crapulax I've pushed what should be a fix for the multi-channel race to the feature/issue_1865 branch. It's difficult to exactly reproduce your issue in a unit test - we don't yet have support for injecting artificial caching delays in the unit test framework, and running under race doesn't reliably reproduce the issue. However, I've validated the fix based on what I think is an equivalent scenario.
If you're able to retest with the latest on the branch, I'm keen to hear the results.
Thanks.
Hello Adam,
I am going to test it right now. I keep you updated.
Brgds, Steve.
Hello Adam,
It looks better but I had an errror. With 1.3.1 I was having 5 to 10 errors per 30 iterations... With your fix i had 1 error on four runs of 30 iterations on user40439, all docstep docs are missing. There is probably a case not covered, I need to investigate logs (tomorrow) to figure out what is this case. I created a gist of logs (https://gist.github.com/Crapulax/27973329734d6d9e6eee69a3b36214f8) if you want to have a look. create user docs for step 4045 create docs in channel mychannel4045 create grant doc for users step 4045 *** error on user user40439 got 3 docs!!! *** Wait for ends of listens create user step 4046 waiting for end of users creation
more /tmp/sgres/resuser40439
{"seq":97358,"id":"_user/user40439","changes":[]}
{"seq":97368,"id":"docuser40439","changes":[{"rev":"1-0486b53025f989fcd079ef275e689012"}]}
{"seq":97398,"id":"docinvuser40439","changes":[{"rev":"1-5dcb0589e60836600758756284dc9db6"}]}
Thanks again for your reactivity we are progressing!
I had a look to log and I do not see the channel expand, for user 40349, on channel 'mychannel4043'. If I recall correctly I was always seeing in expand all channels for a given users. This looks odd.
Thanks for the additional logs - I'll take a look.
Take it for what it worths, just an intuition. But it might be something linked to _sync:user updates are not correctly handled. Also this issue looks a bit similar to what we had yesterday with my 'fix'. It can be a second issue... While looking to logs I was running more tests iterations and no additionnal errors...
In order of appearance.
1) getting doc which grants access:
Line 10641: 2016-09-06T22:20:51.259Z CRUD+: Invoking sync on doc "docgrntuser40439" rev 1-08f743da4e5889e9ae68f33c9cf01903
Line 10643: 2016-09-06T22:20:51.263Z Access: Doc "docgrntuser40439" grants channel access: map[user40439:TB=mychannel4043:97397,TU=user40439:97397]
2) doc in cache received and reload user but no reference to mychannel4043 in reload user while it is in notify
Line 12343: 2016-09-06T22:20:55.292Z Cache: Received #97397 after 4029ms ("docgrntuser40439" / "1-08f743da4e5889e9ae68f33c9cf01903")
Line 12478: 2016-09-06T22:20:55.299Z Cache: #97398 ==> channel "U=user40439"
Line 12481: 2016-09-06T22:20:55.299Z Changes+: Notifying that "xxxx" changed (keys="{*, TM=mychannel4043, U=user40435, U=user40436, U=user40437, U=user40438, U=user40439}") count=1762
Line 12493: 2016-09-06T22:20:55.300Z Changes+: MultiChangesFeed: channels expand to channels.TimedSet{"!":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x1}, "U=user40439":channels.VbSequence{VbNo:(*uint16)(nil), Sequence:0x17c58}} ... (to user40439)
Line 12495: 2016-09-06T22:20:55.300Z Cache: getCachedChanges("U=user40439", 97368) --> 1 changes valid from #97369
Line 12496: 2016-09-06T22:20:55.300Z Changes+: Sending seq:97398 from channel U=user40439
I am tired and going back home. But having a look to traces I just sent you, we should have in fact: Cache: #97397 ==> channel 'mychannel4043'
But having a look to traces I just sent you, we should have in fact: Cache: #97397 ==> channel 'mychannel4043'
I don't think so - #97397 is the grant document, which doesn't get assigned to any channels (based on the latest sync function you shared).
I spotted a potential race in the user change count calculation and pushed the fix to feature/issue_1865, but I don't think it's actually what's going on for user40439 - it's more likely to occur when there's a race between user doc changes, and the initial changes request.
Hello,
Before using your additional fix, I was willing to reproduce yesterday issue to get a reference on error ratio/occurence. Unfortunately, while I ran tests this morning during several hours I had no errors at all which is good and bad. Bad because it prevents us to confirm that the 'second' issue is corrected too.
Anyway I install your fix in Alpha and will ran tests. We will see if: 1- tests stays ok 2- we again run into alpha issue with missing docs on real cases
I will send you a status by end of today.
Brgds, Steve.
Good news and lucky guy. While preparing the package I finally got an error. So if I have no error at all by end of today (Paris time), with lastest fix, we can reasonnably assume that we are fine!
I get two errors:
create docs in channel mychannel5178 create grant doc for users step 5178 *** error on user user51768 got 3 docs!!! *** Wait for ends of listens create user step 5179 ...
Wait for ends of listens create user step 5221 *** error on user user52185 got 12 docs!!! *** waiting for end of users creation listen for users in step 5221 create user docs for step 5221
Hello Adam,
I schedule to analyze logs this weekend, I need to progress on my development. There are two three things that I have not understood in SG code which are making the analize of logs tricky for me. I am sure that it should not be a big deal for you to give me some tips to understand. I will send you my questions this week, if you are kindle enough to answer them, it will be helpful. I have also to reproduce the issue with additional logs (mostly with user who does the action) to ease logs reading and grep.
If you are fine with this, I can create another issue because I consider this one as closed with your fix. Please let me know if tou agree with this.
And again thanks for your reactivity it is really appreciated.
Steve
Hello,
When you one shot a granting access document you might not receive on resumed replications all documents in the channel since channel exists. This issue is critical. For one who might asked, we do a one shot replication because of issue https://github.com/couchbase/sync_gateway/issues/1720#issuecomment-214552671 I can send full logs in pivate mode. As a side note, although I have no evidence of it, this might also happen without the one shot replication. We did observed in the past such missing doc issue while not using yet the one shot for granting documents.
In substance when the channel change notification is arriving after the user reload function on the one shot replication feed, the information saying that this channel is new to the user is lost for ever. In file db.changes.go func (db Database) SimpleMultiChangesFeed(chans base.Set, options ChangesOptions) (<-chan ChangeEntry, error) { .. waitForChanges .. // if notifications on channel changes are already arrived it is ok, otherwise BOOM, no change found userChangeCount, addedChannels, err = db.checkForUserUpdates(userChangeCount, changeWaiter, to)
Extract from logs (SG 1.2.0 with a few additional traces): 1) User 31dc30d3-2017-41f6-a5a5-e7c5f0e0e847 creates document -AHVbFZo38w8oZpLwHBsi_5 Line 10901: doc id:-AHVbFZo38w8oZpLwHBsi_5 Line 10904: doc Owner_id:31dc30d3-2017-41f6-a5a5-e7c5f0e0e847 // Note the first document sequence for this channel 26998 Line 10915: 2016-06-11T15:51:50.085Z Cache: SAVING #26998
2) One shot replication of the granting document from user 0ce0266a-f159-4fd7-8f53-9ee86952809f: // Open one shot replication (session id = 65 this is my added counter on logs, user 0ce0266a-f159-4fd7-8f53-9ee86952809f) Line 12406: 2016-06-11T15:52:47.304Z Changes: MultiChangesFeed({*}, {Since:27007 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc820c8f500 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:false}) ... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f)
// Check accessible channels so far so good: Line 12411: 2016-06-11T15:52:47.307Z Changes+: MultiChangesFeed: handle channel ! options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x697f, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc820c8f500), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 1 ... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f) Line 12419: 2016-06-11T15:52:47.314Z Changes+: MultiChangesFeed: handle channel 0ce0266a-f159-4fd7-8f53-9ee86952809f options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x697f, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc820c8f500), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27005 ... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f) Line 12423: 2016-06-11T15:52:47.322Z Changes+: MultiChangesFeed: handle channel MU=0ce0266a-f159-4fd7-8f53-9ee86952809f options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x697f, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc820c8f500), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27005 ... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f) Line 12427: 2016-06-11T15:52:47.323Z Changes+: MultiChangesFeed: handle channel BI=d42059482b607e73fd77184611e6795d options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x697f, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc820c8f500), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27007 ... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f)
// going in wait mode Line 12428: 2016-06-11T15:52:47.324Z Changes+: MultiChangesFeed waiting... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f)
Line 12446: 2016-06-11T15:52:47.706Z Access: Doc "d42059482b607e73fd77184611e6795d" grants channel access: map[0ce0266a-f159-4fd7-8f53-9ee86952809f:BB=-AHVbFZo38w8oZpLwHBsi_5:27008,BI=d42059482b607e73fd77184611e6795d:27007 31dc30d3-2017-41f6-a5a5-e7c5f0e0e847:BI=d42059482b607e73fd77184611e6795d:27007] Line 12449: 2016-06-11T15:52:47.707Z Cache: SAVING #27008 Line 12450: 2016-06-11T15:52:47.707Z Access: Invalidate access of "0ce0266a-f159-4fd7-8f53-9ee86952809f" Line 12453: 2016-06-11T15:52:47.708Z Auth: Saved _sync:user:0ce0266a-f159-4fd7-8f53-9ee86952809f: {"name":"0ce0266a-f159-4fd7-8f53-9ee86952809f","all_channels":null,"sequence":27004,"passwordhash_bcrypt":"JDJhJDEwJFJIYnlCd0tTOHJtNXZ5SjJKV3JYUC5XcFd6bTJsRTUwSzJZNW8xam4xM3VTYlB4TWN5Vk5l","explicit_roles":{"public":27004},"rolesSince":{"public":27004}} Line 12454: 2016-06-11T15:52:47.785Z Access: Computed channels for "0ce0266a-f159-4fd7-8f53-9ee86952809f": !:1,0ce0266a-f159-4fd7-8f53-9ee86952809f:27005,BB=-AHVbFZo38w8oZpLwHBsi_5:27008,BI=d42059482b607e73fd77184611e6795d:27007,MU=0ce0266a-f159-4fd7-8f53-9ee86952809f:27005
Line 12544: 2016-06-11T15:52:48.049Z Changes: MultiChangesFeed done, reason is Continuous %!d(bool=false), sentSomething %!d(bool=true) ... (to 65 0ce0266a-f159-4fd7-8f53-9ee86952809f)
// Here is the ISSUE: newly channel created get a start sequence set to 27008!!! Document with start sequence 26998 will never be sent…. Line 12563: 2016-06-11T15:52:48.298Z Changes+: MultiChangesFeed: handle channel BB=-AHVbFZo38w8oZpLwHBsi_5 options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, \ #Seq:0x6980*, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc821469bc0), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is _27008** ... (to 71 0ce0266a-f159-4fd7-8f53-9ee86952809f)
3) The exact same scenario with user ff83e059-12f0-4d24-b30d-db6dab317c78 but this time the channel notification gets sent before the user reload and channel is handled correctly Line 13584: 2016-06-11T15:54:10.476Z Changes: MultiChangesFeed({_}, {Since:27014 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc821c76000 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:false}) ... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13592: 2016-06-11T15:54:10.482Z Changes+: MultiChangesFeed: handle channel MU=ff83e059-12f0-4d24-b30d-db6dab317c78 options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x6986, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc821c76000), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27012 ... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13596: 2016-06-11T15:54:10.491Z Changes+: MultiChangesFeed: handle channel BI=368b028f4a0e91576a8e50cf0f4abcca options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x6986, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc821c76000), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27014 ... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13600: 2016-06-11T15:54:10.492Z Changes+: MultiChangesFeed: handle channel ff83e059-12f0-4d24-b30d-db6dab317c78 options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x6986, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc821c76000), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27012 ... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13605: 2016-06-11T15:54:10.492Z Changes+: MultiChangesFeed: handle channel ! options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x6986, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc821c76000), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 1 ... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13606: 2016-06-11T15:54:10.493Z Changes+: MultiChangesFeed waiting... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13612: ******* debut INVITE Line 13614: doc id:368b028f4a0e91576a8e50cf0f4abcca Line 13615: doc xxxChannel:-AHVbFZo38w8oZpLwHBsi_5
//End of one shot replication
Line 13828: 2016-06-11T15:54:16.240Z Changes: MultiChangesFeed done, reason is Continuous %!d(bool=false), sentSomething %!d(bool=true) ... (to 92 ff83e059-12f0-4d24-b30d-db6dab317c78)
// New session
Line 13836: 2016-06-11T15:54:16.482Z Changes: MultiChangesFeed({_}, {Since:27017 Limit:0 Conflicts:true IncludeDocs:false Wait:true Continuous:false Terminator:0xc820a571a0 HeartbeatMs:300000 TimeoutMs:300000 ActiveOnly:false}) ... (to 98 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13840: 2016-06-11T15:54:16.489Z Changes+: MultiChangesFeed: handle channel MU=ff83e059-12f0-4d24-b30d-db6dab317c78 options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x6989, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(_db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc820a571a0), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27012 ... (to 98 ff83e059-12f0-4d24-b30d-db6dab317c78) Line 13844: 2016-06-11T15:54:16.491Z Changes+: MultiChangesFeed: handle channel BB=-AHVbFZo38w8oZpLwHBsi_5 options are db.ChangesOptions{Since:db.SequenceID{SeqType:1, TriggeredBy:0x0, LowSeq:0x0, Seq:0x6989, Clock:base.SequenceClock(nil), TriggeredByClock:base.SequenceClock(nil), ClockHash:"", SequenceHasher:(*db.sequenceHasher)(nil), vbNo:0x0, TriggeredByVbNo:0x0, LowHash:""}, Limit:0, Conflicts:true, IncludeDocs:false, Wait:false, Continuous:false, Terminator:(chan bool)(0xc820a571a0), HeartbeatMs:0x493e0, TimeoutMs:0x493e0, ActiveOnly:false} seAdded is 27015 ... (to 98 ff83e059-12f0-4d24-b30d-db6dab317c78) …