Closed gardners closed 3 years ago
Working on making ethertest.a65 a more useful test to support this. Will include ARP, PING and TFTP server
Working on making ethertest.a65 a more useful test to support this. Will include ARP, PING and TFTP server. TFTP get already works. TFTP put is work-in-progress. This has required implementing the missing dos_mkfile() hypertrap, which is making this issue blow out a bit.
ethertest.a65 currently doesn't correctly realise when a file exists, to re-use it, and instead creates it again. dos_mkfile() should fail with "file exists" in this case, but doesn't. Result is duplicate files if attempting to create / put the same file repeatedly.
Ethernet RX seems to be quite stable, testing watching the arrival of ICMP PING requests with ascending sequence numbers.
The problem is still on the sending side.
Things to try:
For 3, it is already inferring OBUFs. SLEW FAST didn't fix it. Not 100% sure if our TX phase adjusting is working, as it should be possible to pick a phase that makes it much worse, but there is no apparent difference. Trying SLEW SLOW DRIVE 4, in case the problem is reflections or ringing
Also should display packets on screen to make sure that reading is not the soiurce of the problem
RX of packets now seems rock solid, and pinging with 950 byte packets results in reliable responses. TFTP server however stalls sending all the time, because of RX errors on the linux side, i.e., we are still TXing the odd dud packet. This probably means we should re-check the TX phase.
Quick test with trying all four TX phase offsets shows problem still happening. Further investigation required.
Using BUFG on the ethernet clock has fixed almost all the RX and TX errors. We are still seeing the occasional lost packet, but no more packet corruption caused by fluffy clocks.
Now, we have a remaining problem where the ethernet locks up and won't TX anymore. It still sees packets arrive, but seems unable to respond to them.
Doing a soft reset of the machine doesn't fix it, so I assume it something in the ethernet controller's internal state machine.
TX lockup actually looks like the ethernet PHY chip gets its MDX status confused.
Otherwise, the only problem we are still seeing is packet loss on the TX direction of ~2-3% without any real explanation.
Removing and reinserting ethernet cable clears MDX status, but doesn't reestablish TX of packets.
Added instrumentation to see TX state machine status, and it looks like it is stuck counting down the TX delay interval.
The new logic for resetting the ethernet state machine seems to work to clear the situation, further confirming it is just a VHDL state machine problem.
That it is a VHDL problem is confirmed by the addition of a "soft reset" option that resets the ETH TX state, without resetting the PHY itself. USing this soft-reset option also unblocks sending.
Problem can be stimulated just by trying to send a bunch of packets in quick succession. In my test, they were just duplicates of 908 byte long ICMP reply packets. What is really weird, is that a symptom of the problem is the TX wait counter counting endlessly, indicating that it is getting into the IdleWait state, which is the only action that the software reset performs.
BUT if you assert the soft-reset for too short a period of time, it doesn't fix the problem.
Ethernet is flaky, presumably because we latch data on rising edge, when it is in fact clocked on falling edges.