bnjmnp / pysoem

Cython wrapper for the Simple Open EtherCAT Master Library
MIT License
95 stars 36 forks source link

Returning PDO frame late (Windows) #130

Closed andre-comet closed 5 months ago

andre-comet commented 5 months ago

Probably this is no issue of pysoem itself, but I occasionally see a case where the pdo frame coming from the Ecat Subdevice is seen late by pysoem running on Windows. I have a simple network for testing, with only one device connected to pysoem. The error appears as an EC_NOFRAME error (workingcounter returns as -1), because the frame from the sub device is not seen returning during the pdo timeout time (here 10ms). In this wireshark grabs, on the right side, are the packets seen by pysoem. It seems that the sub device does not return the pdo frame in time, causing my pysoem implementation to send a BRD frame after pdo timeout (10ms). However, if we look at the frames on the left side, which are captured by a second PC using a Beckhoff ET2000 sniffer, we can see that the frame was sent by the sub device immediately (wireshark shows delay since last packet as 0.000000s) image

So the issue seems to happen somewhere between the NIC and windows application layer. Heightening the timeout to higher values does not help. Is there anything I am missing here, or is there some need to optimize the network performance on a windows machine somehow? Or is it just not possible and we should use linux, or a proper realtime system?

bnjmnp commented 5 months ago

I've also seen occasional working counter errors on Windows but did not spend too much time debugging this. Maybe disabling network services not needed in the adapter settings can give some improvements. For functional testing of Subdevices the working counter errors are not too big of a deal. But depending on your goals it might make sense to switch to pure SOEM and a proper real time system, right.

andre-comet commented 5 months ago

Yes the wkc errors are not often, but they make my tests flaky. I worked around this now by repeating tests that failed once.

disabling network services seemed to help at first but after a while of testing I could not see a correlation between disabling services and the commonness of the error. In the end it seemed just random. However, there seems to be a dependency of NIC model: While all I tested with were Intel models, they were not all the exact same type and on some, the error occurs much more frequent.