[BUG] Incomplete dialog sync when passive node restart durring active trafic

adrien-martin commented 3 years ago

OpenSIPS version you are running

version: opensips 2.4.11 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 585f1b6
main.c compiled on  with gcc 4.9.2

and

version: opensips 3.1.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on  with gcc 8

and

version: opensips 3.2.0-rc1 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, CC_O0, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
main.c compiled on  with gcc 8

Describe the bug

The problem seems to happen with around 50-100 calls per second and 3000-6000 dialogs. The problem happen with even less load when log level is set to debug.

When a passive node restart while the active node is under load the initial sync stops at about 1000-2000 dialogs. The live sync then replicates the new dialogs but it stabilize at different dialog numbers, there are 50-100 dialogs more than on the active node.

Additional context

All nodes have the seed flag, only one is active at a time. seed_fallback_interval option is set to 5s.

Expected behavior

Full sync on startup. No dead dialogs.

Relevant System Logs

ok.log

ko.log

OS/environment information

Operating System: Debian 8 and Debian 10.
OpenSIPS installation: DEB package from upstream repository, backport rebuilt for Debian 8 (2.4.11 commit 585f1b6).
other relevant information:

github-actions[bot] commented 3 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 3 years ago

@github-actions please unstale. I can provide more information if necessary.

github-actions[bot] commented 3 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 3 years ago

@github-actions please unstale.

github-actions[bot] commented 3 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 3 years ago

@github-actions please unstale.

github-actions[bot] commented 3 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 3 years ago

@github-actions please unstale.

stale[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 2 years ago

@github-actions please unstale.

rvlad-patrascu commented 2 years ago

Hi @adrien-martin,

I suppose the issue is still present on the latest 3.2 or 3.1 versions, right? Just trying to make sure as the initial report mentions 3.1.3 and 3.2.0.

So you are saying that "the initial sync stops at about 1000-2000 dialogs", but if the active node still receives traffic during this, the number of dialogs should still increase on the passive node as a result of live replication. If this is the case, how did you determine that the sync stopped at that number?

Have you noticed any flip flops in the states of the nodes (by checking with clusterer_list MI command)?

adrien-martin commented 2 years ago

Hi @rvlad-patrascu,

The first 1000-2000 dialogs where nearly instant as opposed to the 100 cps, it was determined with opensips-cli -x mi get_statistics active_dialogs.

3.1.7 has the same problem. 3.2.4 on the other hand seems not to handle the initial sync at all, dlg_cluster_sync does not work either.

clusterer_list returns Up and enabled.

adrien-martin commented 2 years ago

For the 3.2.4 i removed the sharing_tag on clusterer (probably didn't need it in the first place). Now it behave like 3.1.7 and 2.4.11.

stale[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 2 years ago

Please do not close.

github-actions[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

adrien-martin commented 2 years ago

@github-actions please unstale.

github-actions[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

rvlad-patrascu commented 2 years ago

Hi @adrien-martin ,

Can you test this again on one of the latest 3.1 or 3.2 versions? There were a couple of clusterer fixes that might impact this. Also, if this is easily reproducible maybe you can test it on 3.3 even? There is some useful information that we can gather in 3.3 with the new Status/Report mechanism(by doing opensips-cli -x mi sr_list_reports clusterer).

Nevertheless when reproducing this again, can you grep the logs for all the INFO:clusterer messages? And please post the configuration for the clusterer and dialog modules.

adrien-martin commented 2 years ago

Hi @rvlad-patrascu,

With 3.2.7 there are no dead dialogs, but existing dialogs from the active node are not synced.

Here are the logs when restarted: On the non-active restarted node (id 2):

Jul 22 11:42:45 node-2 opensips[30486]: INFO:clusterer:mod_init: Clusterer module - initializing
Jul 22 11:42:46 node-2 opensips[30500]: INFO:clusterer:handle_pong: Node [1] is UP
Jul 22 11:42:46 node-2 opensips[30500]: INFO:clusterer:send_sync_req: Sent sync request for capability 'dialog-dlg-repl' to node 1, cluster 1
Jul 22 11:42:47 node-2 opensips[30500]: INFO:clusterer:handle_sync_packet: Received all sync packets for capability 'dialog-dlg-repl' in cluster 1

On the active node (id 1):

Jul 22 11:42:46 node-1 opensips[26773]: INFO:clusterer:handle_pong: Node [2] is UP
Jul 22 11:42:46 node-1 opensips[26773]: INFO:clusterer:handle_sync_request: Received sync request for capability 'dialog-dlg-repl' from node 2, cluster 1
Jul 22 11:42:47 node-1 opensips[26773]: INFO:clusterer:send_sync_repl: Sent all sync packets for capability 'dialog-dlg-repl' to node 2, cluster 1

The configuration is:

loadmodule "clusterer.so"
modparam("clusterer", "seed_fallback_interval", 5)
modparam("clusterer", "my_node_id", {{ node_id }})

loadmodule "dialog.so"
modparam("topology_hiding", "force_dialog", 1)
modparam("dialog", "default_timeout", 18000)
modparam("dialog", "dialog_replication_cluster", {{ cluster_id }})

The clusterer table has:

 id | cluster_id | node_id |          url          | state | no_tries |   description    | flags | sip_addr | priority | no_ping_retries
----+------------+---------+-----------------------+-------+----------+------------------+-------+----------+----------+-----------------
 38 |          1 |       1 | bin:192.168.0.1:5059  |     1 |        0 | node-1           | seed  |          |       50 |               3
 37 |          1 |       2 | bin:192.168.0.2:5059  |     1 |        0 | node-2           | seed  |          |       50 |               3

With the version 3.3 the same happen, but there is also a sync from the restarted passive node towards the master at the same time now.

Beside that opensips-cli -x mi sr_list_reports clusterer shows all the dialogs are received. These dialogs are not shown by get_statistics active_dialogs.

On the restarted passive node:

[   
    {   
        "Name": "cap:dialog-dlg-repl",
        "Reports": [
            {   
                "Timestamp": 1658495793,
                "Date": "Fri Jul 22 15:16:33 2022",
                "Log": "Sync requested"
            },
            {   
                "Timestamp": 1658495794,
                "Date": "Fri Jul 22 15:16:34 2022",
                "Log": "Sync started from node [1]"
            },
            {   
                "Timestamp": 1658495794,
                "Date": "Fri Jul 22 15:16:34 2022",
                "Log": "Sync completed, received [6002] chunks"
            }
        ]
    },
    {   
        "Name": "sharing_tags",
        "Reports": []
    },
    {   
        "Name": "node_states",
        "Reports": [
            {   
                "Timestamp": 1658495794,
                "Date": "Fri Jul 22 15:16:34 2022",
                "Log": "Node [1], cluster [1] is UP"
            }
        ]
    }
]

On the active node:

[
    {
        "Name": "cap:dialog-dlg-repl",
        "Reports": [
            {
                "Timestamp": 1658495538,
                "Date": "Fri Jul 22 15:12:18 2022",
                "Log": "Sync requested"
            },
            {   
                "Timestamp": 1658495540,
                "Date": "Fri Jul 22 15:12:20 2022",
                "Log": "Sync completed, received [0] chunks"
            },
            {   
                "Timestamp": 1658495793,
                "Date": "Fri Jul 22 15:16:33 2022",
                "Log": "Sync requested from node [2]"
            },
            {   
                "Timestamp": 1658495794,
                "Date": "Fri Jul 22 15:16:34 2022",
                "Log": "Sync started from node [2]"
            },
            {   
                "Timestamp": 1658495794,
                "Date": "Fri Jul 22 15:16:34 2022",
                "Log": "Sync completed, received [91] chunks"
            }
        ]
    },
    {   
        "Name": "sharing_tags",
        "Reports": []
    },
    {   
        "Name": "node_states",
        "Reports": [
            {   
                "Timestamp": 1658495540,
                "Date": "Fri Jul 22 15:12:20 2022",
                "Log": "Node [2], cluster [1] is UP"
            },
            {   
                "Timestamp": 1658495791,
                "Date": "Fri Jul 22 15:16:31 2022",
                "Log": "Node [2], cluster [1] is DOWN"
            },
            {   
                "Timestamp": 1658495793,
                "Date": "Fri Jul 22 15:16:33 2022",
                "Log": "Node [2], cluster [1] is UP"
            }
        ]
    }
]

rvlad-patrascu commented 2 years ago

Hi @adrien-martin,

Can you try again with the latest fixes on 3.2 or 3.3?

Also, note that part of the problem was that dialogs were not marked with any sharing tag. Indeed those dialogs should not be discarded after syncing and this should now be fixed, but I just want to mention that it is recommend to use them. It is usually helpful to control which node takes certain actions (see the docs here).

adrien-martin commented 2 years ago

Hi @rvlad-patrascu,

With 3.3.0~20220801~9cdcdf238-1 the passive node get up to 2/3 of the dialogs. Sometime the active node loose all dialogs. Sometime on the passive node the replicated dialogs seem to be dead, do not disappear and stay with state 4.

sr_list_reports clusterer shows Sync timed out, received [number] chunks but increasing seed_fallback_interval does not help.

Sorry i was not aware sharing tags were important as the main replication was working as expected. I guess I will try to use the anycast setup. Is topology_hiding() enough before set_dlg_sharing_tag() if force_dialog parameter is set?

With two active sharing tags the behaviour looks similar, but i'm not sure it is configured correctly. I added modparam("clusterer", "sharing_tag", "node_1/1=active")/modparam("clusterer", "sharing_tag", "node_2/1=active"). And set_dlg_sharing_tag("node_1")/set_dlg_sharing_tag("node_2"). But sr_list_reports clusterer list no sharing tags.

rvlad-patrascu commented 2 years ago

Hi @adrien-martin,

With 3.3.0~20220801~9cdcdf238-1 the passive node get up to 2/3 of the dialogs.

So are those 2/3 of dialogs shown by get_statistics active_dialogs? What about the number shown by the Sync completed, received [number] chunks report?

Sometime the active node loose all dialogs. Sometime on the passive node the replicated dialogs seem to be dead, do not disappear and stay with state 4.

Can you post the output of sr_list_reports clusterer for such cases?

sr_list_reports clusterer shows Sync timed out, received [number] chunks but increasing seed_fallback_interval does not help.

This is not related to seed_fallback_interval it only tells us that the sync end marker packet has not been received so the sync process is probably incomplete and has failed.

Is topology_hiding() enough before set_dlg_sharing_tag() if force_dialog parameter is set?

Yes, it should be enough.

With two active sharing tags the behaviour looks similar, but i'm not sure it is configured correctly.

It looks like the sharing tags are in fact configured correctly, so probably sr_list_reports clusterer simply has no reports regarding state changes to show. You can check the tags with clusterer_list_shtags MI function.

adrien-martin commented 2 years ago

Hi @rvlad-patrascu,

With 3.3.0~20220801~9cdcdf238-1 the passive node get up to 2/3 of the dialogs.

So are those 2/3 of dialogs shown by get_statistics active_dialogs? What about the number shown by the Sync completed, received [number] chunks report? Yes, and it is the same number in the report.

Sometime the active node loose all dialogs. Sometime on the passive node the replicated dialogs seem to be dead, do not disappear and stay with state 4.

Can you post the output of sr_list_reports clusterer for such cases?

I think lost dialogs only happen without sharing tags. Both nodes sync at the same timestamp with a small number of chunks (in all tests the second node never generate any dialogs).:

            {   
                "Timestamp": 1660744538,
                "Date": "Wed Aug 17 15:55:38 2022",
                "Log": "Sync started from node [1]"
            },
            {   
                "Timestamp": 1660744538,
                "Date": "Wed Aug 17 15:55:38 2022",
                "Log": "Sync completed, received [91] chunks"
            }

With different tags or same tag with different states the node creating dialogs try to sync too, but usually find no donor ("Donor node not found, fallback to synced state"). Sometime the main node sync a small number of dialog instead but does not appear to loose other dialogs in this case. I think the restarted node is considered a donor too early.

rvlad-patrascu commented 2 years ago

Hi @adrien-martin

Can you also test with this modparam set on both nodes ? modparam("dialog", "cluster_auto_sync", 0)

adrien-martin commented 2 years ago

Hi @rvlad-patrascu,

Ah thanks i didn't know about this parameter (currently using old versions). With this parameter the dialog generating node does not try to sync when the other one restart. But the restarted node sync still timeout the same way as before.

rvlad-patrascu commented 2 years ago

Hi @adrien-martin

So the actual number of synced dialogs (returned by get_statistics) is the same as shown in that Sync timed out, received [number] chunks report log, right? Can you maybe post debug logs from both nodes?

adrien-martin commented 2 years ago

Hi @rvlad-patrascu,

Sorry, my answer was badly formatted and included in the quote block. The number of synced dialogs reported by get_statistics and sr_list_reports clusterer are the same.

I will add debug logs when I find a fine threshold to prevent debug overload. Right now it has an impact on the replication, some packets are lost.

adrien-martin commented 2 years ago

Here are some debug logs (filtered on clusterer module, i can provide unfiltered if needed): node_1.log node_2.log

github-actions[bot] commented 2 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions[bot] commented 2 years ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

liviuchircu commented 2 years ago

Let's keep this one open, as I've ran into a similar "asymmetric backup node restart" issue as well, where the backup sometimes seems to have more dialogs than the active one, following a restart. I'm not sure about the "less dialogs" scenario discussed in this topic, maybe that one is possible as well even on my setup. Nevertheless, I will try to reproduce the issue in a LAB environment and provide more info.

EDIT: in my case, the "extra dialogs on backup" issue is always fixed with a simple dlg_cluster_sync command, ran on the backup after I do the restart. Still, things shouldn't work like this...

github-actions[bot] commented 1 year ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions[bot] commented 1 year ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

OpenSIPS / opensips

[BUG] Incomplete dialog sync when passive node restart durring active trafic #2574