MEGA65 / mega65-core

MEGA65 FPGA core
Other
242 stars 88 forks source link

Ethernet is flaky on recent builds #145

Closed gardners closed 3 years ago

gardners commented 4 years ago

Ethernet is flaky, presumably because we latch data on rising edge, when it is in fact clocked on falling edges.

gardners commented 4 years ago

Working on making ethertest.a65 a more useful test to support this. Will include ARP, PING and TFTP server

gardners commented 4 years ago

Working on making ethertest.a65 a more useful test to support this. Will include ARP, PING and TFTP server. TFTP get already works. TFTP put is work-in-progress. This has required implementing the missing dos_mkfile() hypertrap, which is making this issue blow out a bit.

gardners commented 4 years ago

ethertest.a65 currently doesn't correctly realise when a file exists, to re-use it, and instead creates it again. dos_mkfile() should fail with "file exists" in this case, but doesn't. Result is duplicate files if attempting to create / put the same file repeatedly.

gardners commented 4 years ago

Ethernet RX seems to be quite stable, testing watching the arrival of ICMP PING requests with ascending sequence numbers.

The problem is still on the sending side.

gardners commented 4 years ago

Things to try:

  1. Add back 200MHz TX clock to allow for proper TX phase adjustment.
  2. Do we need fast slew?
  3. Do we need an OBUF thingy to improve signal integrity?
gardners commented 4 years ago

For 3, it is already inferring OBUFs. SLEW FAST didn't fix it. Not 100% sure if our TX phase adjusting is working, as it should be possible to pick a phase that makes it much worse, but there is no apparent difference. Trying SLEW SLOW DRIVE 4, in case the problem is reflections or ringing

gardners commented 4 years ago

Also should display packets on screen to make sure that reading is not the soiurce of the problem

gardners commented 4 years ago

RX of packets now seems rock solid, and pinging with 950 byte packets results in reliable responses. TFTP server however stalls sending all the time, because of RX errors on the linux side, i.e., we are still TXing the odd dud packet. This probably means we should re-check the TX phase.

gardners commented 4 years ago

Quick test with trying all four TX phase offsets shows problem still happening. Further investigation required.

gardners commented 4 years ago

Using BUFG on the ethernet clock has fixed almost all the RX and TX errors. We are still seeing the occasional lost packet, but no more packet corruption caused by fluffy clocks.

gardners commented 4 years ago

Now, we have a remaining problem where the ethernet locks up and won't TX anymore. It still sees packets arrive, but seems unable to respond to them.

gardners commented 4 years ago

Doing a soft reset of the machine doesn't fix it, so I assume it something in the ethernet controller's internal state machine.

gardners commented 4 years ago

TX lockup actually looks like the ethernet PHY chip gets its MDX status confused.

Otherwise, the only problem we are still seeing is packet loss on the TX direction of ~2-3% without any real explanation.

gardners commented 4 years ago

Removing and reinserting ethernet cable clears MDX status, but doesn't reestablish TX of packets. Added instrumentation to see TX state machine status, and it looks like it is stuck counting down the TX delay interval. The new logic for resetting the ethernet state machine seems to work to clear the situation, further confirming it is just a VHDL state machine problem.
That it is a VHDL problem is confirmed by the addition of a "soft reset" option that resets the ETH TX state, without resetting the PHY itself. USing this soft-reset option also unblocks sending.

gardners commented 4 years ago

Problem can be stimulated just by trying to send a bunch of packets in quick succession. In my test, they were just duplicates of 908 byte long ICMP reply packets. What is really weird, is that a symptom of the problem is the TX wait counter counting endlessly, indicating that it is getting into the IdleWait state, which is the only action that the software reset performs.

gardners commented 4 years ago

BUT if you assert the soft-reset for too short a period of time, it doesn't fix the problem.