Open adrien-martin opened 3 years ago
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@github-actions please unstale. I can provide more information if necessary.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@github-actions please unstale.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@github-actions please unstale.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@github-actions please unstale.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@github-actions please unstale.
Hi @adrien-martin,
I suppose the issue is still present on the latest 3.2 or 3.1 versions, right? Just trying to make sure as the initial report mentions 3.1.3 and 3.2.0.
So you are saying that "the initial sync stops at about 1000-2000 dialogs", but if the active node still receives traffic during this, the number of dialogs should still increase on the passive node as a result of live replication. If this is the case, how did you determine that the sync stopped at that number?
Have you noticed any flip flops in the states of the nodes (by checking with clusterer_list
MI command)?
Hi @rvlad-patrascu,
The first 1000-2000 dialogs where nearly instant as opposed to the 100 cps, it was determined with opensips-cli -x mi get_statistics active_dialogs
.
3.1.7 has the same problem.
3.2.4 on the other hand seems not to handle the initial sync at all, dlg_cluster_sync
does not work either.
clusterer_list
returns Up
and enabled
.
For the 3.2.4
i removed the sharing_tag
on clusterer (probably didn't need it in the first place).
Now it behave like 3.1.7
and 2.4.11
.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Please do not close.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
@github-actions please unstale.
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Hi @adrien-martin ,
Can you test this again on one of the latest 3.1 or 3.2 versions? There were a couple of clusterer fixes that might impact this. Also, if this is easily reproducible maybe you can test it on 3.3 even? There is some useful information that we can gather in 3.3 with the new Status/Report mechanism(by doing opensips-cli -x mi sr_list_reports clusterer
).
Nevertheless when reproducing this again, can you grep the logs for all the INFO:clusterer
messages? And please post the configuration for the clusterer and dialog modules.
Hi @rvlad-patrascu,
With 3.2.7 there are no dead dialogs, but existing dialogs from the active node are not synced.
Here are the logs when restarted: On the non-active restarted node (id 2):
Jul 22 11:42:45 node-2 opensips[30486]: INFO:clusterer:mod_init: Clusterer module - initializing
Jul 22 11:42:46 node-2 opensips[30500]: INFO:clusterer:handle_pong: Node [1] is UP
Jul 22 11:42:46 node-2 opensips[30500]: INFO:clusterer:send_sync_req: Sent sync request for capability 'dialog-dlg-repl' to node 1, cluster 1
Jul 22 11:42:47 node-2 opensips[30500]: INFO:clusterer:handle_sync_packet: Received all sync packets for capability 'dialog-dlg-repl' in cluster 1
On the active node (id 1):
Jul 22 11:42:46 node-1 opensips[26773]: INFO:clusterer:handle_pong: Node [2] is UP
Jul 22 11:42:46 node-1 opensips[26773]: INFO:clusterer:handle_sync_request: Received sync request for capability 'dialog-dlg-repl' from node 2, cluster 1
Jul 22 11:42:47 node-1 opensips[26773]: INFO:clusterer:send_sync_repl: Sent all sync packets for capability 'dialog-dlg-repl' to node 2, cluster 1
The configuration is:
loadmodule "clusterer.so"
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "my_node_id", {{ node_id }})
loadmodule "dialog.so"
modparam("topology_hiding", "force_dialog", 1)
modparam("dialog", "default_timeout", 18000)
modparam("dialog", "dialog_replication_cluster", {{ cluster_id }})
The clusterer table has:
id | cluster_id | node_id | url | state | no_tries | description | flags | sip_addr | priority | no_ping_retries
----+------------+---------+-----------------------+-------+----------+------------------+-------+----------+----------+-----------------
38 | 1 | 1 | bin:192.168.0.1:5059 | 1 | 0 | node-1 | seed | | 50 | 3
37 | 1 | 2 | bin:192.168.0.2:5059 | 1 | 0 | node-2 | seed | | 50 | 3
With the version 3.3 the same happen, but there is also a sync from the restarted passive node towards the master at the same time now.
Beside that opensips-cli -x mi sr_list_reports clusterer
shows all the dialogs are received.
These dialogs are not shown by get_statistics active_dialogs
.
On the restarted passive node:
[
{
"Name": "cap:dialog-dlg-repl",
"Reports": [
{
"Timestamp": 1658495793,
"Date": "Fri Jul 22 15:16:33 2022",
"Log": "Sync requested"
},
{
"Timestamp": 1658495794,
"Date": "Fri Jul 22 15:16:34 2022",
"Log": "Sync started from node [1]"
},
{
"Timestamp": 1658495794,
"Date": "Fri Jul 22 15:16:34 2022",
"Log": "Sync completed, received [6002] chunks"
}
]
},
{
"Name": "sharing_tags",
"Reports": []
},
{
"Name": "node_states",
"Reports": [
{
"Timestamp": 1658495794,
"Date": "Fri Jul 22 15:16:34 2022",
"Log": "Node [1], cluster [1] is UP"
}
]
}
]
On the active node:
[
{
"Name": "cap:dialog-dlg-repl",
"Reports": [
{
"Timestamp": 1658495538,
"Date": "Fri Jul 22 15:12:18 2022",
"Log": "Sync requested"
},
{
"Timestamp": 1658495540,
"Date": "Fri Jul 22 15:12:20 2022",
"Log": "Sync completed, received [0] chunks"
},
{
"Timestamp": 1658495793,
"Date": "Fri Jul 22 15:16:33 2022",
"Log": "Sync requested from node [2]"
},
{
"Timestamp": 1658495794,
"Date": "Fri Jul 22 15:16:34 2022",
"Log": "Sync started from node [2]"
},
{
"Timestamp": 1658495794,
"Date": "Fri Jul 22 15:16:34 2022",
"Log": "Sync completed, received [91] chunks"
}
]
},
{
"Name": "sharing_tags",
"Reports": []
},
{
"Name": "node_states",
"Reports": [
{
"Timestamp": 1658495540,
"Date": "Fri Jul 22 15:12:20 2022",
"Log": "Node [2], cluster [1] is UP"
},
{
"Timestamp": 1658495791,
"Date": "Fri Jul 22 15:16:31 2022",
"Log": "Node [2], cluster [1] is DOWN"
},
{
"Timestamp": 1658495793,
"Date": "Fri Jul 22 15:16:33 2022",
"Log": "Node [2], cluster [1] is UP"
}
]
}
]
Hi @adrien-martin,
Can you try again with the latest fixes on 3.2 or 3.3?
Also, note that part of the problem was that dialogs were not marked with any sharing tag. Indeed those dialogs should not be discarded after syncing and this should now be fixed, but I just want to mention that it is recommend to use them. It is usually helpful to control which node takes certain actions (see the docs here).
Hi @rvlad-patrascu,
With 3.3.0~20220801~9cdcdf238-1
the passive node get up to 2/3 of the dialogs.
Sometime the active node loose all dialogs.
Sometime on the passive node the replicated dialogs seem to be dead, do not disappear and stay with state 4.
sr_list_reports clusterer
shows Sync timed out, received [number] chunks
but increasing seed_fallback_interval
does not help.
Sorry i was not aware sharing tags were important as the main replication was working as expected.
I guess I will try to use the anycast setup.
Is topology_hiding()
enough before set_dlg_sharing_tag()
if force_dialog
parameter is set?
With two active sharing tags the behaviour looks similar, but i'm not sure it is configured correctly.
I added modparam("clusterer", "sharing_tag", "node_1/1=active")
/modparam("clusterer", "sharing_tag", "node_2/1=active")
.
And set_dlg_sharing_tag("node_1")
/set_dlg_sharing_tag("node_2")
.
But sr_list_reports clusterer
list no sharing tags.
Hi @adrien-martin,
With
3.3.0~20220801~9cdcdf238-1
the passive node get up to 2/3 of the dialogs.
So are those 2/3 of dialogs shown by get_statistics active_dialogs
? What about the number shown by the Sync completed, received [number] chunks
report?
Sometime the active node loose all dialogs. Sometime on the passive node the replicated dialogs seem to be dead, do not disappear and stay with state 4.
Can you post the output of sr_list_reports clusterer
for such cases?
sr_list_reports clusterer
showsSync timed out, received [number] chunks
but increasingseed_fallback_interval
does not help.
This is not related to seed_fallback_interval
it only tells us that the sync end marker packet has not been received so the sync process is probably incomplete and has failed.
Is
topology_hiding()
enough beforeset_dlg_sharing_tag()
ifforce_dialog
parameter is set?
Yes, it should be enough.
With two active sharing tags the behaviour looks similar, but i'm not sure it is configured correctly.
It looks like the sharing tags are in fact configured correctly, so probably sr_list_reports clusterer
simply has no reports regarding state changes to show. You can check the tags with clusterer_list_shtags
MI function.
Hi @rvlad-patrascu,
With
3.3.0~20220801~9cdcdf238-1
the passive node get up to 2/3 of the dialogs.So are those 2/3 of dialogs shown by
get_statistics active_dialogs
? What about the number shown by theSync completed, received [number] chunks
report? Yes, and it is the same number in the report.Sometime the active node loose all dialogs. Sometime on the passive node the replicated dialogs seem to be dead, do not disappear and stay with state 4.
Can you post the output of
sr_list_reports clusterer
for such cases?
I think lost dialogs only happen without sharing tags. Both nodes sync at the same timestamp with a small number of chunks (in all tests the second node never generate any dialogs).:
{
"Timestamp": 1660744538,
"Date": "Wed Aug 17 15:55:38 2022",
"Log": "Sync started from node [1]"
},
{
"Timestamp": 1660744538,
"Date": "Wed Aug 17 15:55:38 2022",
"Log": "Sync completed, received [91] chunks"
}
With different tags or same tag with different states the node creating dialogs try to sync too, but usually find no donor ("Donor node not found, fallback to synced state"). Sometime the main node sync a small number of dialog instead but does not appear to loose other dialogs in this case. I think the restarted node is considered a donor too early.
Hi @adrien-martin
Can you also test with this modparam set on both nodes ?
modparam("dialog", "cluster_auto_sync", 0)
Hi @rvlad-patrascu,
Ah thanks i didn't know about this parameter (currently using old versions). With this parameter the dialog generating node does not try to sync when the other one restart. But the restarted node sync still timeout the same way as before.
Hi @adrien-martin
So the actual number of synced dialogs (returned by get_statistics) is the same as shown in that Sync timed out, received [number] chunks
report log, right? Can you maybe post debug logs from both nodes?
Hi @rvlad-patrascu,
Sorry, my answer was badly formatted and included in the quote block.
The number of synced dialogs reported by get_statistics
and sr_list_reports clusterer
are the same.
I will add debug logs when I find a fine threshold to prevent debug overload. Right now it has an impact on the replication, some packets are lost.
Here are some debug logs (filtered on clusterer module, i can provide unfiltered if needed): node_1.log node_2.log
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.
Let's keep this one open, as I've ran into a similar "asymmetric backup node restart" issue as well, where the backup sometimes seems to have more dialogs than the active one, following a restart. I'm not sure about the "less dialogs" scenario discussed in this topic, maybe that one is possible as well even on my setup. Nevertheless, I will try to reproduce the issue in a LAB environment and provide more info.
EDIT: in my case, the "extra dialogs on backup" issue is always fixed with a simple dlg_cluster_sync
command, ran on the backup after I do the restart. Still, things shouldn't work like this...
Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.
Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.
OpenSIPS version you are running
Describe the bug
The problem seems to happen with around 50-100 calls per second and 3000-6000 dialogs. The problem happen with even less load when log level is set to debug.
When a passive node restart while the active node is under load the initial sync stops at about 1000-2000 dialogs. The live sync then replicates the new dialogs but it stabilize at different dialog numbers, there are 50-100 dialogs more than on the active node.
Additional context
All nodes have the
seed
flag, only one is active at a time.seed_fallback_interval
option is set to 5s.Expected behavior
Full sync on startup. No dead dialogs.
Relevant System Logs
ok.log
ko.log
OS/environment information