OpenEtherCATsociety / SOEM

Simple Open Source EtherCAT Master
Other
1.35k stars 686 forks source link

Strange Cone shape deviation with DC sync. #866

Closed MStackoverflow closed 1 month ago

MStackoverflow commented 1 month ago

Hi, I've been trying to understand this for days now and I cannot find why I'm getting this behavior. If anyone has had experience with this, please let me know :

deviation = ec_DCtime-previous-ecatLoopPeriod

Average max deviation : 5000 ns Max : 287481 ns Total Average : 0 ns

image

I have a technosoft iMotionCube drive that runs at 0.8ms slow loop. They say that "communication cycle shall be a multiple of the SYNC0 cycle, SYNC0 shall be a multiple of control loop ". This is the timing they say to respect : https://technosoftmotion.com/wp-content/uploads/P091.064.EtherCAT.iPOS_.UM_.0623.pdf image

My SOEM is set to run at 800000 ns, with an offset of 41%. Here are snippets :

ecatLoopPeriod = 800000;
ecatLoopOffset = ecatLoopPeriod*1000/410;

void ec_sync(int64 reftime, int64 cycletime , int64 *offsettime)
{
    delta = (reftime - ecatLoopOffset) % cycletime;
    if(delta>cycletime / 2) { delta= delta - cycletime;}
    if(delta>0){ integral++;}
    if(delta<0){ integral--;}
    *offsettime = -(delta /100) - (integral / 20);
   gl_delta = delta;
}

.......

while(true){
    while(chrono::steady_clock::now()<nextLoop){;}

    wkc = ec_receive_processdata(EC_TIMEOUTRET);

    if(ec_slave[0].hasdc){
        ec_sync(ec_DCtime,ecatLoopPeriod,&toff);
        nextLoop+=std::chrono::nanoseconds(toff);
        deviation = ec_DCtime-previous-ecatLoopPeriod;
        timepoints[i++] = deviation;
        previous=ec_DCtime;
    }

    ec_send_processdata();

    nextLoop+=std::chrono::nanoseconds(ecatLoopPeriod);
    this_thread::sleep_until(nextLoop-200us);

}

The thread runs in real-time (FIFO, 99) isolated on core 1. I moved the Ethernet IRQ affinity on core 1 also. disabled everything that is related to that ethernet port (no IRQ is seen without the program) I get the SAME behavior when I run on CodeSys. If I run a loop and compare with the time of chrono::steady_clock::now(), the loop runs at a deviation of 120ns.

Any insight would be appreciated :)

ArthurKetels commented 1 month ago

Interesting problem! This needs some tests to isolate the cause of the measured deviation spikes.

A pity you did not add any horizontal axis, so I have no idea of the time scale.

Also it is nice to know if the anomaly starts with a deviation under- or overshoot? From the symmetry it is clear that any deviation spike is followed by one with the exact opposite sign. This makes sense if the time point, at which the packet is send, is shifted only once.

Then measure the time interval over ec_receive_processdata() and ec_send_processdata(). It might be one of those is taking much longer than expected.

In you case the EC_TIMEOUTRET can be reduced to zero. You have already waited for more than than the packet transmission time. so if the packet has not be received by then, it will never be received. Better to cut your losses and continue with the next cycle. You will experience packet loss, the question is only how little. And you can better be prepared how to handle packet loss in your application.

The CPU isolation is a good idea, but raising the priority to 99 is not. You do have to service the NIC driver, and that only runs at priority 50. So when you are busy spinning in a busy loop for your cycle to start (around 200us each time), you effectively block any NIC driver operation. I always keep my PDO task at maximum priority 49.

MStackoverflow commented 1 month ago

Alright I'll reduce the nice and fiddle with timeouts.

The deviation starts with a positive number and then a negative number. It is to note that the PI control does not do such aggressive changes, meaning that it seems to be something outside the "user" code. Maybe the NIC driver like you said.

Each point is a sample and it's over a period of 120000 samples (about 1.6min).

I will also try the code on another hardware. Maybe I'll get different results. I did try with other Ethercat slave devices and I get the same behavior. The cone shape changes if I change the offset or the frequency, but there's a clear pattern. Hopefuly I can find what's going on and help other people that will stumble upon this behavior.

MStackoverflow commented 1 month ago

I've tested on two other devices and I don't see the problem. After researching more, It looks like it's an OEM kernel driver issue with the SBC I'm working on.

The RPI4 has a deviation of 30000ns and a LAN7430 on PCI has 6000ns deviation

image