Pulses being dropped on sampling

davidgiven commented 5 years ago

My logic analyser has finally arrived and I've determined that the sampler is, indeed, dropping pulses. Top row is what FluxEngine's sampler is producing, the bottom row is what the logic analyser sees.

screenshot

This one's from a Mac 800kB disk, but this shows up on the ND 17b disks too (and probably lots more). I have no idea why it's not showing up more. We're reading both HD and DD disks quite happily in other formats.

davidgiven commented 5 years ago

The good news is that because a pulse is simply dropped but the timing doesn't otherwise change, it looks like whatever's happening is in the sampler or upstream, and not a bytecode processing issue --- as the bytecode concerns itself with intervals, if an interval was dropped I'd expect the timing to change.

I wonder if I've remembered to properly synchronise the pulses before sampling them?

davidgiven commented 5 years ago

Well, well, well... top line, after pulse conversion; bottom line, raw from the disk. Looks like the pulse conversion is borked. This might be an easy fix.

screenshot

davidgiven commented 5 years ago

Actually, I now think this is insufficient resolution on the logic analyser: it only goes up to 12MHz, and the pulses from the FDD are shorter than this. So the pulse conversion has worked but we can't see the originating FDD pulse.

There is something very odd going on with the sampling clock, though.

pelrun commented 5 years ago

Hang on, I'm confused now about which side is losing pulses. Is one of the images above labelled backwards by any chance?

Now that you mention it, I don't think you're using the Pulse Convert blocks correctly either. They're meant to convert from the sample_clock domain to the output_clock domain, but you're using the same clock input for both. What happens if you connect sample_clk to BUS_CLK instead?

davidgiven commented 5 years ago

I believe the top screenshot was labelled incorrectly --- it was posted very late at night, and yes, I had misread them anyway.

I no longer believe that this is as simple as I thought. I managed to persuade the logic analyser to go all the way up to 24Mz, which is long enough to reliably capture the (very very short, maybe 150ns) pulses from the drive, and I don't think I can see any signs of dropped pulses. However I'm still struggling to find a way to display two logic analyser captures side by side. I may have to write some custom code for this.

Regarding the pulse converters: I did spot that one, and changed it, but it doesn't appear to have made a difference. I am not at all sure I'm doing this right. I tried exporting 12MHz counter clock signal to the logic analyser and it appears to be painfully garbled but that may be a logic analyser artifact.

There's also something really weird going on with clock timings. I have to explicitly sync the 12MHz clock against BUS_CLK (the block's on the main page), or else I get timing errors regarding crossing asynchronous clock paths in the UDB data blocks... but the 12MHz clock is derived from BUS_CLK, so should surely be synchronous with it. I just don't understand how that works.

I am very tempted to go back to the old pure-logic sampler setup, without using the UDB datablock, but that would involve both giving up the DMA FIFO but would also require changing the bytecode format to something simpler and higher bandwidth.

pelrun commented 5 years ago

facepalm - the RDATA input pin is configured with CMOS logic levels, which has the highest Vih threshold of the available options (3.5v). It should really be using LVTTL, which has a Vih of 2V.

The pin config can also automatically synchronise the input to BUS_CLK, I think that should really be enabled too.

davidgiven commented 5 years ago

I have a vague memory that I tried the LVTTL option, but I'll certainly try it again. Likewise pin synchronisation. That shouldn't be necessary given the pulse converter but at this point I'll try anything...

Another thing is to remove the explicit CounterClock synchronisation and use a UdbClkEn directly attached to the UDB component. I am wondering whether the Sync block attached to CounterClock isn't doing what I thought.

It really doesn't help that the build tools only work on Windows and my logic analyser only works on Linux. I've been doing a lot of rebooting.

davidgiven commented 5 years ago

LVTTL and pin synchronisation: no effect. Worth trying and I should have had them anyway, so thanks --- anything else I'm obviously missing? (I've also discovered that my 12MHz counter was actually about 12.8ish Mhz, so that's fixed too.)

After much effort I've managed to compare the fluxengine output with the raw data from the logic analyser. It... all looks fine to me? I don't see any dropped pulses anywhere.

Here are two wav files, one representing the fluxengine data and the other the logic analyser for probably a single revolution. One track contains the index pulses and the other the flux data. You can load them both into audacity, use the time-shift tools to match up the index pulses, and compare the data directly. The clocks don't quite match up but I don't see any signs of sampling problems.

data.zip

So either this is a decode problem, which I doubt, or something electrical and the drive is giving me bad data...

davidgiven commented 5 years ago

I feel like a bit of an idiot. I've been testing with a Mac 800kB disk, and I've just discovered that if I use --bit-error-threshold=0.3 it (mostly) reads fine. And the Norsk Data disks which have also been exhibiting the problem now read absolutely fine, and have done for a while --- I checked back through the VCS, although gave up before I found the bad commit.

This probably got fixed when I reworked the decoders a while back, and I never noticed. Uh, oops?

Anyway, the changes made here are all good ones, and I have a tonne more debugging tools, including the ability to export flux tracks to both VCD and AU files for logic analysers and Audacity respectively...

davidgiven commented 5 years ago

Got something.

screenshot samples.zip

(You'll need to zoom in, github's image scaling has lost some pulses.)

This is the raw pulses from two consecutive FluxEngine runs on the same disk. Two pulses are definitely being dropped. Sadly the logic analyser wasn't running for this one. I'm using the new fluxengine convert fluxtoau tool to turn flux tracks into audio files which lets me load them into Audacity side by side.

So, this is a hardware bug, and is causing the problems for Mac disks. Weird it only shows up there.

This is a 3.97us clock. The distance between those two narrow pulses being dropped is 36 samples, which at 12MHz is 3us.

pelrun commented 5 years ago

Ouch.

A couple of things I noticed in the sampler:

sampleclocked and indexed are triggering on both rising and falling edges?
The "Load F0 with A0" logic can simplified to (FSM == EVENT_STATE) || (FSM == INDEX_STATE) || (FSM == PULSE_STATE) by just putting an !indexed transition from PULSE_STATE to RELAX_STATE.
The instruction labels INDEX and PULSE seem to be swapped.

Something else about the sampler is bugging me, but I can't put my finger on it yet.

davidgiven commented 5 years ago

Okay, will check those out. The sampler's a complete garbage fire and I'm thoroughly unhappy with it. I'd much rather use simple hand-tooled verilog, or even just raw logic, but the only way of getting a DMAable FIFO is via the UDB F0 register. It is possible to configure a UDB block to accept a parallel input and pass it directly to the FIFO, and apparently there are some prebuilt components to do this available from a blog somewhere, but there aren't any standard components to do so. Which I find frankly strange; aren't FIFOs core building blocks?

I'm also going to try and use the logic analyser to sample the input to the sampler so I can check to see if the pulses are actually present on the input bitstream.

davidgiven commented 5 years ago

Re 2 and 3: yup, fixed. Sadly, that makes no difference.

Re 1: given that sampleclocked = sampleclock && !oldsampleclock, then sampleclocked should only be high if sampleclock is currently high and oldsampleclock is low, i.e. a rising edge. Am I wrong there?

...although, now I look at it... the evaluation order is deceptive: all values in a state are evaluated simultaneously. So the evaluation of sampleclocked used to decide whether to transition from WAIT to CLOCKED is using the old value, not the one which has just been evaluated.

That should be fine, but rdata is connected directly to the input logic. It's evaluated for the transition to CLOCKED to EVENT, which will happen two, possibly three 64MHz cycles after sampleclocked gets set. That 1/32M of a second or about 30ns after the pulse itself. But the pulse has been normalised to 1/12M of a second, so that should be fine, right?

davidgiven commented 5 years ago

No, snapshotting rdata in WAIT and using the recorded value elsewhere doesn't help. Worth a try, though. Thanks for the suggestions!

Back to the logic analyser, I suppose...

davidgiven commented 5 years ago

Well, pants. From the top down:

FluxEngine's captured pulsetrain for two consecutive captures.
First group of three: log analyser capture of raw FDD data, data after NOT gate, data after pulse conversion, for the first capture.
Second group of three: ditto for the second capture.

(The timestamps in the top one are twice what they should be. That's a bug.)

So, you can see that the missing pulses actually come from the drive itself. So... it's nothing whatsoever to do with the firmware, which seems fine.

screenshot samples.zip

I'm at a bit of a loss, now. The Kryoflux is capable of capturing Mac disks, so I should be able to too; I think I'm going to have to assume that the drive is reading the pulses, but it's just not passing them to me. So it must be something I'm doing wrong, and I really can't think what...

andrewferguson commented 5 years ago

[I've been trying to understand this thread, but I'm afraid it's a bit above my knowledge, so apologies if this is an obvious / stupid question].

If the data comes from two consecutive reads, why does one have more pulses than the other? If the floppy drive isn't passing the pulses onto the FluxEngine, then surely it would be doing this for every read, not intermittently.

davidgiven commented 5 years ago

There appear to be plenty of Kryoflux forum posts about people claiming exactly the same thing. Apparently the only real solution is to try another drive. Setting the density select flag sometimes makes a difference, but 3.5" drives mostly ignore this. https://forum.kryoflux.com/viewtopic.php?f=3&t=697

The other odd thing is that the errors don't appear to be random. Some positions on the disk are more susceptible than others. Using --revolutions=20 (the maximum before the watchdog timer fires and kills the board) still reports bad sectors.

I'm wondering now if, e.g., Mac drives suffer from wandering alignment at certain rotational speeds, which makes the signal just a little bit marginal. So, some drives will read them, and some drives will flake out.

Not good news for me, either way.

davidgiven commented 5 years ago

[I've been trying to understand this thread, but I'm afraid it's a bit above my knowledge, so apologies if this is an obvious / stupid question].

There are no stupid questions! Otherwise I would barely be able to talk to anyone...

If the data comes from two consecutive reads, why does one have more pulses than the other? If the floppy drive isn't passing the pulses onto the FluxEngine, then surely it would be doing this for every read, not intermittently.

Yes, that's what's so weird. Floppy disk drives do suffer from transient errors, and you just retry and the error goes away, which is why you always need to specify a format to do a reliable read (so FluxEngine can check for errors using the CRCs). But other than those (and these errors are way too persistent for that) the drive itself has to be reliable. Even a single missed pulse will ruin a read. So, yeah, I've been assuming that the drive was showing me what was there. If that's not actually true, for whatever reason, then all bets are off.

andrewferguson commented 5 years ago

So trying a different drive might help? Well, I have 16 drives*! What specifically should I do? Just try and generate a flux image from the same Mac disk using every drive? What command should I run?

*16 internal 3.5-inch drives. I've have a some more unusual ones tucked away in boxes as well.

davidgiven commented 5 years ago

That... would actually be super helpful, as it would characterise the different behaviour the various drives have. It's a lot of work, though, and I would definitely owe you a beer (or substitute thereof) if you were ever passing through Zurich...

Yes, image the same disk in multiple drives (noting which drive it is!). The command would be:

./fluxengine read mac -s :s=0 --bit-error-threshold=0.3 --retries=0 --write-flux=something.flux

That will only image one side, but that's fine; it'll make .img files but they're largely useless. The important thing to have is the sector map at the end, but the .flux files are worth keeping too.

You probably want to update to the mac branch first, for both firmware and client, as that's got the Mac disk sanity check fix in it which should avoid FluxEngine reporting 255-sector disks.

Thank you very much!

andrewferguson commented 5 years ago

No problem! Thank you for spending so much time trying to troubleshoot this issue. I'll get it done tomorrow morning (UK time) as it's getting a bit late now.

Do you want flux readings of a blank image, a system disk, or that "textfile" disk?

davidgiven commented 5 years ago

You sent me a log of the blank disk, which shows a neat band of Bs in the middle tracks, so try that one? The contents shouldn't matter (although I've encountered formats where it did...).

andrewferguson commented 5 years ago

Here's the thing... I don't know why it did that. It's not "the blank disk", every image I've been sending is using the same physical 800kb disk, but each time I perform a fresh format on the Macintosh Classic II (and then in the cases of minisystem / textfile, add the appropriate files onto the disk). So while it gave a band of "B" in the last read, it may not when I format it again...

pelrun commented 5 years ago

Re 2 and 3: yup, fixed. Sadly, that makes no difference.

I didn't expect it to - just documentation and refactoring, those.

Re 1: given that sampleclocked = sampleclock && !oldsampleclock, then sampleclocked should only be high if sampleclock is currently high and oldsampleclock is low, i.e. a rising edge. Am I wrong there?

You're totally right, I had a brain fart.

If the floppy drive isn't passing the pulses onto the FluxEngine, then surely it would be doing this for every read, not intermittently.

The signals on the disk are analogue signals, so it's entirely possible for some to be weaker than others, and there's always random noise in the signal. As the input signal drops, it doesn't go directly from 100% perfect reads to 0%, instead it becomes probabilistic.

andrewferguson commented 5 years ago

Right, so that didn't go quite as well as expected!

I got through 11 drives, some of which are in pretty poor shape. On the 12th drive, I accidentally plugged the connector in the wrong way (in my defense, the socket was keyed so I couldn't put the connector in any other way. It did seem wrong to me, but I tried it anyway). This resulted in (from what you've said about plugging the connector in the wrong way) a track being erased, so this "perfectly scientific" test is now anything but.

So I've stopped with the intention of doing it all again later. I've uploaded the data from the 11 drives I got through, along with the python script I used to make it. If you want any adjustments made to how I'm doing this then let me know and I'll do that when I do it again (probably tomorrow).

files.zip [~18MB]

andrewferguson commented 5 years ago

OK, round two (ding ding).

After a bit of a faff involving a quick-and-bodged repair to the Mac Classic II, I formatted the 800k disk, then read it using a total of 18 drives. This was the 16 drives I had spare, plus two others currently installed in a PC I had sitting in a box.

You'll find the "drives.csv" file as the spreadsheet indicating the make and model of each of the drives, and linking the drive to the numeric ID of the .flux / .img / .log files that are attached too. I also adjusted the script I used to make the files (included as go.py) so that any errors were written to the log file as well.

Let me know how you get on! I haven't looked at any of the files yet, but from the sounds some of the drives were making, I have a feeling that I have fewer working floppy drives than I originally thought.

multidrives-1.zip multidrives-2.zip

davidgiven commented 5 years ago

Thanks very much --- I've aggregated the data here: https://drive.google.com/open?id=1ZL8ksXs79DX-dkqD-YDy0bccmcYqCpW_UHM7KxG2Yuc

Well, that's... interesting. Some of the drives have relatively decent reads except for the band down the middle. Some produce complete garbage. Some produce no data at all, I mean, literally zero pulses (drives 3 and 10) --- do you know if these work?

The bad news is that I've confirmed this is apparently a known problem with reading Mac disks in PC drives. From some old documentation for the Deluxe Option Board: ftp://ftp.mindcandydvd.com/pub/Optio...Drive_Note.pdf

-- Compatible Drives --
Virtually every 720K drive
Citizen 1.44 Meg
TEAC 1.44 Meg
Toshiba 1.44 Meg

--Incompatible Drlves --
Alps 1.44 Meg
Mitsubishi 1.44 Meg
Mitsumi 1.44 Meg
Panasonic 1.44 Meg

However, this was written in 1986 and drive technology will have changed.

Can I ask you to try one more thing: those drives that produce garbage, like number 4 (the TEAC), intrigue me. Some 3.5" floppy disk drives support an input signal to tell them to select high-density vs double-density media, but most floppy drives ignore this and autodetect based on the hole in the disk. Could you try and reimage that disk with the --hd option and everything else the same? I don't need the flux file, just the sector map at the end of the log.

(I haven't been able to tell which signal level selects high vs double density, so I may have it backwards, so setting --hd selects double density. It has no effect on my drives.)

Thanks!

andrewferguson commented 5 years ago

Aw, I was getting excited here. Running it with the --hd option (everything else the same) gives:

0. 0 B.?.B.......?..........B...........B............BXB.BBBXBXXBBBBB...............B
0. 1 ..........?....?..............X..............B..BXBBBBBBBBB.BXBB................
0. 2 ..........?...............B.....................BBBBBBBBBBBBXB.B.........B...B..
0. 3 .B...B................B.........................B.BB.XBBBBBBBBBB..........B.....
0. 4 ..?.?.........?.B..............BB...........B...X.B.BBB.BBXBBBBB....B...........
0. 5 ...?.....................B..............B.......BBBBB..BBBBXBXXB.....B..........
0. 6 ..........B??.?......B................B.........B.BBBBB..BBBB.BB..B...B.........
0. 7 ......?...?....?..............B.....B....B......BBBBBB..BBBBBB.B...B...B........
0. 8 .........?...?..........B..............B........BB...B.B...B.BBBXXXXXXXXXXXXXXXX
0. 9 ..?.?....B.B??.......................B....B.....XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
0.10 ...?..?.........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
0.11 ..?.............XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Good sectors: 643/960 (66%)
Missing sectors: 174/960 (18%)
Bad sectors: 143/960 (14%)
80 tracks, 1 heads, 12 sectors, 524 bytes per sector, 491 kB total

but then I ran it without the --hd parameter and got basically the same:

0. 0 B.B.B.?.....?.?........B.......................BBXBBBBBXBXXBB.BBB..............B
0. 1 ..........?......B..............................BBBBBBB.BBB..BBB................
0. 2 .......?..?...............B.....................BBBBBBBBBBBBXBBB.........B...B..
0. 3 .B...B....??..........B...........B.............BBBB.BBBBBBBBBBB..........B.....
0. 4 ..?........?..?................BB...........B...B.BBBXB.BBBBBBB.....B...........
0. 5 ..??.........??..........B..............B.......BBBBX..BXBBBBBBB.....B..........
0. 6 .......?B.B??..?...B............................BBBBBBB..BBBB.BB..B...B.........
0. 7 .......?......................B..........B......BBBBBB..B.BBBXXB................
0. 8 .........?....?.........B..............B........BBB.XX.BB..BXB.BXXXXXXXXXXXXXXXX
0. 9 ..?.?..B.B...?....B..................B....B.....XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
0.10 ...?..?....................B.B..XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
0.11 ..?.??..........XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Good sectors: 638/960 (66%)
Missing sectors: 173/960 (18%)
Bad sectors: 149/960 (15%)
80 tracks, 1 heads, 12 sectors, 524 bytes per sector, 491 kB total

So I'm not sure what happened the first time. I don't think it was a one-off, because it also didn't read well on the Saturday (when I gave up half way through when I accidentally erased a track by plugging in the connector the wrong way round).

The link to the PDF in your previous thread didn't come out, mind linking to it again? Unfortunately I don't have and 720k-only drives (well, at least, not spare, and doing a disassembly of one of my Amstrad portables to get the drive is something I'd like to avoid - for now at least!)

davidgiven commented 5 years ago

Re --hd: aw, that's a shame.

ftp://ftp.mindcandydvd.com/pub/OptionBoard/(1989)%20Deluxe%20Option%20Board%20v5.4/Package%20Contents/Documentation/OB_Drive_Note.pdf

I have several 720kB floppy disks. Like you, they're all inside vintage laptops. I console myself with the thought that they all have weird pinouts and wouldn't be any use anyway.

davidgiven / fluxengine

Pulses being dropped on sampling #75