Avnu / OpenAvnu

OpenAvnu - an Avnu sponsored repository for Time Sensitive Network (TSN and AVB) technology
471 stars 291 forks source link

[gPTP daemon] Synchronization issues, no Synch messages being sent. #639

Open vitorroriz opened 7 years ago

vitorroriz commented 7 years ago

Good day everybody.

My setup is: 2 devices, running Linux imx7dsabresd 4.1.15, directly connected through the eth1 interfaces of each device. I am using the master branch of OpenAvnu, and I added the ptp group for both devices, through groupadd ptp.

My initial aim is to get synchronization with the gPTP daemon so I can move on.


First case: When I run both devices with S option, i.e.:

./daemon_cl eth1 -S

Both set asCapable true, and they keep exchanging PDelay messages, but no Synch message is sent. Here is the log for the device1:

INFO : GPTP [12:39:20:420] gPTP starting ERROR : GPTP [12:39:20:423] Using clock device: /dev/ptp1 DEBUG : GPTP [12:39:20:423] [../../common/common_port.cpp:549] Received POWERUP/INITIALIZE event STATUS : GPTP [12:39:20:423] Starting PDelay DEBUG : GPTP [12:39:20:423] [../src/linux_hal_common.cpp:254] Interface eth1 is up STATUS : GPTP [12:39:20:423] Link Speed: 1000000 kb/sec DEBUG : GPTP [12:39:20:439] [../../common/common_port.cpp:599] ANNOUNCE_INTERVAL_TIMEOUT_EXPIRES occured DEBUG : GPTP [12:39:20:455] [../../common/ether_port.cpp:496] PDELAY_INTERVAL_TIMEOUT_EXPIRES occured DEBUG : GPTP [12:39:20:457] [../../common/ptp_message.cpp:573] TX PHY compensation: 0 0.000000184 sec DEBUG : GPTP [12:39:20:457] [../../common/ether_port.cpp:519] Sent PDelay Request message DEBUG : GPTP [12:39:20:457] [../../common/ether_port.cpp:536] Schedule PDELAY_RESP_RECEIPT_TIMEOUT_EXPIRES, PDelay interval 129, timeout 5000000 EXCEPTION: GPTP [12:39:20:462] PDelay Response Receipt Timeout ... ... DEBUG : GPTP [12:39:21:315] [../../common/ptp_message.cpp:290] Received PDelay Response message, Timestamp 1482323961 (sec) 314064945 (ns), seqID 49 DEBUG : GPTP [12:39:21:315] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [12:39:21:316] [../../common/ptp_message.cpp:349] Received PDelay Response FollowUp message DEBUG : GPTP [12:39:21:316] [../../common/ptp_message.cpp:1664] Link delay: 0 ns STATUS : GPTP [12:39:21:316] AsCapable: Enabled DEBUG : GPTP [12:39:21:318] [../../common/ether_port.cpp:496] PDELAY_INTERVAL_TIMEOUT_EXPIRES occured DEBUG : GPTP [12:39:21:333] [../../common/ptp_message.cpp:573] TX PHY compensation: 0 0.000000184 sec DEBUG : GPTP [12:39:21:333] [../../common/ether_port.cpp:519] Sent PDelay Request message DEBUG : GPTP [12:39:21:334] [../../common/ether_port.cpp:536] Schedule PDELAY_RESP_RECEIPT_TIMEOUT_EXPIRES, PDelay interval 129, timeout 5000000 DEBUG : GPTP [12:39:21:335] [../../common/ptp_message.cpp:241] *** Received PDelay Request message DEBUG : GPTP [12:39:21:335] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [12:39:21:335] [../../common/ptp_message.cpp:1201] Process PDelay Request SeqId: 0 DEBUG : GPTP [12:39:21:336] [../../common/ptp_message.cpp:573] TX PHY compensation: 0 0.000000184 sec


Second Case: When I force device 1 into Master, and device 2 into Slave modes with the -T and -L options, the SYNC_INTERVAL_TIMEOUT_EXPIRES event is scheduled in the Master, and now shm/ptp is written. But this happens only once, and the values are not continuous updated. Also, he "asCapable" value in the file is also false, even though both devices set it True according to the log. Log for device1:

INFO : GPTP [12:42:15:854] gPTP starting ERROR : GPTP [12:42:15:856] Using clock device: /dev/ptp1 DEBUG : GPTP [12:42:15:856] [../../common/common_port.cpp:549] Received POWERUP/INITIALIZE event STATUS : GPTP [12:42:15:857] Switching to Master DEBUG : GPTP [12:42:15:857] [../src/linux_hal_common.cpp:254] Interface eth1 is up STATUS : GPTP [12:42:15:857] Link Speed: 1000000 kb/sec DEBUG : GPTP [12:42:15:873] [../../common/common_port.cpp:599] ANNOUNCE_INTERVAL_TIMEOUT_EXPIRES occured DEBUG : GPTP [12:42:15:873] [../../common/common_port.cpp:627] SYNC_INTERVAL_TIMEOUT_EXPIRES occured DEBUG : GPTP [12:42:15:873] [../../common/ieee1588clock.cpp:276] Calculated local to system clock rate difference DEBUG : GPTP [12:42:15:905] [../../common/ether_port.cpp:496] PDELAY_INTERVAL_TIMEOUT_EXPIRES occured DEBUG : GPTP [12:42:15:907] [../../common/ptp_message.cpp:573] TX PHY compensation: 0 0.000000184 sec DEBUG : GPTP [12:42:15:907] [../../common/ether_port.cpp:519] Sent PDelay Request message DEBUG : GPTP [12:42:15:907] [../../common/ether_port.cpp:536] Schedule PDELAY_RESP_RECEIPT_TIMEOUT_EXPIRES, PDelay interval 129, timeout 5000000 EXCEPTION: GPTP [12:42:15:912] PDelay Response Receipt Timeout ... ... DEBUG : GPTP [12:42:16:590] [../../common/ptp_message.cpp:290] Received PDelay Response message, Timestamp 1482324136 (sec) 589474435 (ns), seqID 39 DEBUG : GPTP [12:42:16:590] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [12:42:16:591] [../../common/ptp_message.cpp:349] * Received PDelay Response FollowUp message DEBUG : GPTP [12:42:16:591] [../../common/ptp_message.cpp:1664] Link delay: 0 ns STATUS : GPTP [12:42:16:592] AsCapable: Enabled**


As the shm/ptp variable was not being updated, I connected a device directly to a computer to filter in wireshark what PTP messages were being sent. When I executed:

./daemon_cl eth1 -S -T

setting the device as master, I saw that the only messages sent were Path_Delay_Req_Message. No Synch message is sent:

image

Am I missing something? Also, shouldn't the devices negotiate the master and slave positions, through the BCMA algo?

pinealservo commented 7 years ago

Something apparently goes wrong with the interval initialization on ARM platforms. If you look at the logs, you'll see interval amounts with ridiculous values like 129; that means 2^129 seconds, which is obviously wrong. It gets clamped to EVENT_TIMER_GRANULARITY, which from the logs is apparently 5000000ns, or 5ms.

You can also see in the Wireshark log that the Path Delay Request messages are being sent way too close together--the default value should be 1 second intervals.

To fix this, you just need to set the correct default values in a config file or pass them on the command line. This issue is on my list of things to investigate, but I was not really expecting as much interest in ARM-based platforms as we've seen lately. If you want to investigate what's going on with the initialization of these values, we'd definitely appreciate the contribution.

vitorroriz commented 7 years ago

I think there's something strange.

I am now passing the intervals through the command line. I am not using the automotive profile. Now, the IPC shared variable is being updated. However I noticed that the Sync Count on the slave was not being incremented, this is because the Slave is becoming a Master. The issue is that, always, after 3 seconds, the Slave becomes Master and it doesn't matter the parameters that are passed. It seems that if you are not using the automotive profile, nobody calls "setintervals". Then, getAnnounceInterval is always 0. This makes the AnnounceTimeout interval be always 3 seconds, considering that the multiplier = 3 :

clock->addEventTimerLocked(this, ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES,(ANNOUNCE_RECEIPT_TIMEOUT_MULTIPLIER*(unsigned long long)(pow((double)2,getAnnounceInterval())*1000000000.0)));

But even with the standard timeout, of 3 seconds, this event should be deleted for incoming announceMessages, but it seems that it never happens. Digging into the code we have:

This change, of the slave becoming a master is only triggered by CommonPort::processSyncAnnounceTimeout, this function is called only in the merged-switch-case of two events:

case ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES:
case SYNC_RECEIPT_TIMEOUT_EXPIRES:

By debugging we can see that the event = 14 is the one triggering the case, and event = 14, means ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES.

Then I started to track when this event was being deleted from the queue and as mentioned, it seems that it never happens. Here is the log for the slave node, I added a personal print to show the events being deleted:

INFO : GPTP [15:11:52:728] gPTP starting ERROR : GPTP [15:11:52:731] Using clock device: /dev/ptp1 DEBUG : GPTP [15:11:52:731] [../../common/common_port.cpp:549] Received POWERUP/INITIALIZE event VITOR DEBUG: deleting event 15 VITOR DEBUG: deleting event 10 STATUS : GPTP [15:11:52:731] Switching to Slave DEBUG : GPTP [15:11:52:731] [../src/linux_hal_common.cpp:254] Interface eth1 is up STATUS : GPTP [15:11:52:732] Link Speed: 1000000 kb/sec DEBUG : GPTP [15:11:53:213] [../../common/ptp_message.cpp:241] Received PDelay Request message DEBUG : GPTP [15:11:53:213] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [15:11:53:213] [../../common/ptp_message.cpp:1203] Process PDelay Request SeqId: 1 DEBUG : GPTP [15:11:53:214] [../../common/ptp_message.cpp:573] TX PHY compensation: 0 0.000000184 sec DEBUG : GPTP [15:11:53:214] [../../common/ptp_message.cpp:1217] Sent PDelay Response message DEBUG : GPTP [15:11:53:214] [../../common/ptp_message.cpp:1255] Sent PDelay Response FollowUp message DEBUG : GPTP [15:11:54:186] [../../common/ptp_message.cpp:166] Received Sync message DEBUG : GPTP [15:11:54:186] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [15:11:54:187] [../../common/ptp_message.cpp:191] Received Follow Up message DEBUG : GPTP [15:11:54:187] [../../common/ptp_message.cpp:1005] Processing a follow-up message VITOR DEBUG: deleting event 12 DEBUG : GPTP [15:11:55:188] [../../common/ptp_message.cpp:166] Received Sync message DEBUG : GPTP [15:11:55:188] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [15:11:55:189] [../../common/ptp_message.cpp:191] Received Follow Up message DEBUG : GPTP [15:11:55:189] [../../common/ptp_message.cpp:1005] Processing a follow-up message VITOR DEBUG: deleting event 12 DEBUG : GPTP [15:11:55:222] [../../common/ptp_message.cpp:241] Received PDelay Request message DEBUG : GPTP [15:11:55:222] [../../common/ether_port.cpp:232] RX PHY compensation: 0 0.000000382 sec DEBUG : GPTP [15:11:55:222] [../../common/ptp_message.cpp:1203] Process PDelay Request SeqId: 2 DEBUG : GPTP [15:11:55:223] [../../common/ptp_message.cpp:573] TX PHY compensation: 0 0.000000184 sec DEBUG : GPTP [15:11:55:223] [../../common/ptp_message.cpp:1217] Sent PDelay Response message DEBUG : GPTP [15:11:55:223] [../../common/ptp_message.cpp:1255] Sent PDelay Response FollowUp message VITOR DEBUG: processEvent : TIMEOUT for event 14 STATUS : GPTP [15:11:55:731] *** Announce Timeout Expired - Becoming Master

pinealservo commented 7 years ago

Regarding setintervals: This is related to signalling messages, which are used more heavily in the automotive profile. These are for communicating the intervals in use and requesting runtime changes and that sort of thing. The setintervals method itself just fills out fields in a signalling message packet, but we should have reasonable values set even if we never send or receive a signalling message.

Regarding the Announce receipt timeout: The 3x multiplier is because you can miss 3 of them before you should conclude that the master is gone. From the log, it appears that you are receiving Sync messages from the master, but that you have not received any Announce messages. You can see the deletions of event 12 after every sync, but you never get an Announce, so you never see deletion of a 14. Because the Announce interval is always 0, the master should be sending one every second, but if it is, you don't seem to be getting it. Can you see whether the master thinks it is sending Announce messages?

By the way--I think I've finally discovered why the default intervals are not getting set properly on ARM platforms. The PortInit_t structure that we use to pass the configured values to the port is given sentinel values so that the port can recognize if no configuration was given and it should set its default values. Unfortunately, we used a negative sentinel value and the fields are of type char, which is signed on Intel and unsigned on ARM. You might think this would still be fine because we're looking for the same value we set, but that's not the case because of integer promotion rules. The char-length values are promoted to signed int before comparison occurs, and so the char-length value is sign-extended on Intel and zero-extended (as it's read from an unsigned variable) on ARM. The literal we're comparing to is sign-extended on both platforms, so it matches on Intel and doesn't on ARM.

pinealservo commented 7 years ago

BTW, I have just submitted a pull request for a patch that fixes the interval init problem if you want to try it out before it gets merged.

Also, I just remembered you mentioned you were experimenting with the Automotive Profile. If your master is still using that, be aware that it doesn't use BMCA at all, which may be why you are not receiving Announce messages on the other device.

andrew-elder commented 7 years ago

Wow, good catch @pinealservo !

vitorroriz commented 7 years ago

Good afternnon @pinealservo and @andrew-elder.


  1. I tested the commit to correct the initial values and it didn't work, I reported my log in the commit topic itself.
  2. Then, I found a problem:

Indeed as sugested by @pinealservo, my master was not sending announce messages. This is because, in the first attempt of sending a Announce message, if asCapable was still False. Then we were breaking and returning true, without restarting the timer. I temporarily changed it like this:

        case ANNOUNCE_INTERVAL_TIMEOUT_EXPIRES:
        GPTP_LOG_DEBUG("ANNOUNCE_INTERVAL_TIMEOUT_EXPIRES occured");
        if( !asCapable )
        {
            ret = true;
            fprintf(stderr, "VITOR DEBUG: Announce message not sent\n");
        //  break;
        }

        // Send an announce message
        else
        {
            PTPMessageAnnounce *annc =
                new PTPMessageAnnounce(this);
            PortIdentity dest_id;
            PortIdentity gmId;
            ClockIdentity clock_id = clock->getClockIdentity();
            gmId.setClockIdentity(clock_id);
            getPortIdentity( dest_id );
            annc->setPortIdentity( &dest_id );
            annc->sendPort( this, NULL );
            fprintf(stderr, "VITOR DEBUG: Announce sending attempt\n");
            delete annc;
        }

        startAnnounceIntervalTimer
            ((uint64_t)( pow((double)2, getAnnounceInterval()) *
                     1000000000.0 ));
        ret = true;
        break;
pinealservo commented 7 years ago

If asCapable is set to False, then it should not be sending Announce or Sync messages. I don't think it will be very fruitful to investigate this further until you have got your link delay calculations worked out, although it seems there may be some bugs lurking here in some edge cases that happens when bad link delays occur.

pinealservo commented 7 years ago

I apologize for the incomplete init fix; there is a new pull request that completes the fix and I made sure I actually ran it without setting the delay parameters on the command line this time.

vitorroriz commented 7 years ago

@pinealservo thanks for the new patch regarding the initial values for arm arch and your explanations about the phy delay.

The problem is that if asCapable was false once, and then it was set true, for some reason the announce messages are not being sent and then both devices think they are the master. Note the sync delay count and grandmaster id below:

Device1 ......................................................................................................... Device2 image

I followed your comment about the Phy delays and also I read a thread that you have already discussed this, then I made some math and adjusted my Rx and Tx delays accordingly. I achieved a reasonable link delay for a 2 meters, CAT.6, back-to-back connection. I am printing the link delay each time it is set by setLinkDelay. However, for some reason that I need to investigate further, getLinkDelay is still always returning 0 in the log:

image

andrew-elder commented 7 years ago

Looks like you are on the right track... You are probably going to have to review timestamps in detail!

vitorroriz commented 7 years ago

Interesting, in the print of the GPTP log debug, the link delay was being printed as %ld, when I force it to %lld, as in my personal debug prints, the values match.

image

pinealservo commented 7 years ago

Hmm, it could be that I'm not seeing this problem because I've got a switch in between my devices. I was searching yesterday for how we're handling asCapable changes and it's possible that we're not correctly dealing with all the transitions there.

Are you passing any command line parameters or anything in particular in the config files aside from the phy delays now?

andrew-elder commented 7 years ago

Maybe time to rerun against Avnu's gPTP test suite if anyone has access.

vitorroriz commented 7 years ago

In the config file I just commented the line stating 100Mb linkspeed and let the gigabit linkspeed line. Then I tunned the the Rx and Tx delays.

In the command line, I was usually passing -S and -T options.

I made a test. I noticed that asCapable starts as True if you are using the automotive profile and false otherwise. This is an issue because, for some reason, if you have asCapable = false once, then the master stop schedulling announce messages even after asCapable = true again for this node. Then, I changed the initial value of asCapable for true, giving a chance to the system and hoping it will clear the value if the system is eventually asCapable = false.

The results were that it seems to work, in the sence that one of the devices has port_state = 7 and the other port_state = 9, and also the sync count is consistent (one has sync_count = 0 and the other always has it increasing). What is strange though, is the fact that both devices keep having the gptp grandmaster_id = 00:00:00:...:00.

elmad commented 7 years ago

Hi, I'm trying AVB back-to-back as vitorroriz, my setup is: 2 PC with i210 NICs, linux kernel 4.9.0-3-amd64 for both, cable cat 5e 15 meters.

I have the very same problems he described here and here, I tried his temporary solution to change the initial value of asCapable as true, and seems to work, but only if I start first the Slave (-S -L) and then the Master on the other PC (-S -T).

Regarding the PHY delay calibration, sorry for the question, but I'm here after studying the standards and now I want to study here an implementation: do you have any link where I can understand this problem? Is it a PHY internal delay that you were talking about? If so, the value should be in the datasheet of the NIC, right? And how is this tied to the cable length?