Open kristomu opened 4 years ago
Yes, I am interested; can this technique be applied to non-MFM formats? From looking at the code, it seems that kmedian.py knows nothing about the format itself, so it seems so.
The maths is somewhat beyond me and Wikipedia isn't helping, but my impression is that k-median analysis is basically the Correct™ way of doing the analysis which I hacked together. If so, this would definitely be an improvement. If not, what's the ELI5?
It looks like you're analysing an entire track at a time rather than taking it sector-by-sector. The results obviously speak for themselves, but I would have thought that the garbage data and misalignment between records would throw off the results --- particularly as records written on different machines would have slightly different clock rates. This was one reason for the approach I took. Or is the algorithm capable of compensating for this?
It can be applied to non-MFM formats as long as you know the number of bands. I think FM only has two; and something exotic like (1,7)-RLL would have more (perhaps 7).
The ELI5 of k-median is: Find k points (clusters) so that every other point is as close as possible to its closest cluster. (That is, the sum of distances from every other point to its closest cluster should be minimized.) Each band get its "representative", and each point that's closest to the first band's representative is counted as belonging to the first band.
This naturally resists garbage because suppose that 80% of the track follows a certain pattern, and the other 20% is garbage. Then the garbage pulls on the clusters less strongly than the 80% that follows the pattern: it isn't "worth it" to move the clusters closer to the garbage because that moves them further away from the pattern indicated by the regular sectors.
It might have a greater chance of getting confused if the different sectors are written by different drives and there are some marginal situations in one of the sectors, e.g. sector 1 is very fast and sector 2 is very slow and there's no good compromise between them. (Edit: I found an example.)
But the runtime of the code I've written is dominated by the time it takes to read off the pulse delays, so it shouldn't be any slower to do it per sector. The harder problem is to find out where the sector boundaries are because you need to find the preamble (00... A1A1A1) to determine the start of a sector, but you can't do that if you're classifying the pulses wrong.
Local dewarping handles that problem because it doesn't need to know where the boundaries are to correct them, but it's currently quite slow.
And finally, if 80% is garbage and 20% is data, e.g. only a few sectors on an otherwise unformatted track, then it would fail. The algorithm would lock on to what there's more of, and in that case, it's the garbage.
EDIT: The problem of finding out where parameters change (e.g. between an MFM sector and an FM sector, or between noise and MFM, is a special case of what's called change detection. I have some ideas of what kind of algorithms could be used to solve this problem, but better not get too far ahead of myself. AFAIK, KryoFlux uses information entropy to distinguish noise areas from data areas. A very simple approach would be to use the global k-median, decode what sectors can be decoded, then run k-median on the remaining undecodable areas, and repeating the process until everything is decoded or nothing changes.
This looks fascinating. Are you wedded to GPL though?
About the license, I would like my effort and source to be available to as many people as possible, which is why I use GPL as my default license. When porting the Python code to C++, I'll add it to my flux-analyze repo under that license.
However, I don't need either FluxEngine or GreaseWeazle to be GPL, and I'd cross-license my code to the license they're using. I'd probably add a link to the flux-analyze repo in my source, and I'd appreciate it staying there as a reference, but I won't press that to the point of license.
So TLDR, I'm not wedded to the GPL for inclusion into other code, only my own projects :-)
Thank you! Are you happy for your source to be modified under the cross-license, as long as the reference to your GPL 'master' version remains? This is slightly academic at the moment as I'm not at the stage to be concerning myself with better bitcell-band algorithms, but your method looks very promising indeed. It's got to be worth having at least as part of a suite of available algorithms. But when I'm ready to investigate further I will make proper contact and work out the details with you.
All my code is unlicense (PD basically) but I only need a compatible cross-license, so something weak but 'proper' like MIT/BSD would do, and I can account for such details in my license file.
I did a first draft of the optimal k-median code in C++, just optimalKMedian(). It's a simple mechanical translation of kristomu's source with some C++isms. Naturally it doesn't work.
https://github.com/davidgiven/fluxengine/blob/kmedian/lib/kmedian.cc
Tomorrow I'll try running the Python and C++ versions side by side to see how they behave differently, but for now it's late and I'm going to bed...
In terms of integrating into FluxEngine: I'll have to refactor all the decoders, which are currently all based on my own algorithm which detects the clock rate from pattern-matching intervals to locate the start of records. So for now it's just bolted crudely onto fe-inspect, which has the histogram and its own clock-guessing code, so it's a natural fit there for testing.
@kristomu: if you're happy with this (and if it ever works), let me know exactly what boilerplate text you want and I'll happily put it in. My translation is clearly still your code.
For actual FluxEngine usage, you'll need to use wt_optimal_k_median, not optimal_k_median. The n in the former is on the order of 100 (basically the resolution of your timer), while the latter can be on the order of millions.
I'll take a look later; I've been a bit busy cleaning up some completely unrelated code today.
I think your medianPoint is wrong (suppose e.g. high = 10, low = 9). It should be
float m = (high + low - 1) / 2.0;
return (points[(int)floor(m)] + points[(int)ceil(m)]) / 2.0;
With this, I get the expected {22, 34, 47.5} clustering with test case {22, 36, 35, 22, 22, 23, 37, 22, 35, 47}.
D'oh. Yes, after fixing that, it produces exactly the right results; see the four lines at the bottom for the new algorithm.
From your comment, the weighted version should be quicker, right? The performance of the current code is not brilliant, but should be acceptable. I can also generally improve the performance. Long term I'll want the local dewarping (probably as an option).
I had a look at the paper. Yikes.
(I used to play ZZX! MegaZeux, too...)
Clock detection histogram:
22 1.83 97752 ███▍
23 1.92 1123574 ████████████████████████████████████████
24 2.00 792316 ████████████████████████████▏
25 2.08 34105 █▏
...
34 2.83 26771 ▉
35 2.92 430815 ███████████████▎
36 3.00 535126 ███████████████████
37 3.08 231598 ████████▏
38 3.17 19647 ▋
...
46 3.83 18892 ▋
47 3.92 187522 ██████▋
48 4.00 230945 ████████▏
49 4.08 99943 ███▌
50 4.17 12037 ▍
...
Noise floor: 11235
Signal level: 56178
Peak start: 21 (1.75 us)
Peak end: 26 (2.17 us)
Median: 23 (1.92 us)
Calculating centres for 3 bands...
center #0 = 1.92 us
center #1 = 3.00 us
center #2 = 4.00 us
With this, I get the expected {22, 34, 47.5} clustering with test case {22, 36, 35, 22, 22, 23, 37, 22, 35, 47}.
After adding the test case... are you sure? I'm getting {22, 35.5, 47}, which seems reasonable given the clusters are {22, 22, 22, 23, 22}, {36, 35, 37, 35} and {47}.
Doh, I looked at the wrong line. Yes, it's {22, 35.5, 47}.
It can be verified by doing kmedian.optimal_k_median(np.array([22., 36., 35., 22., 22., 23., 37., 22., 35, 47.]), 3)
in Python.
And yes, the weighted version is quicker - much quicker. I added a benchmarking function to mfm_decoder.py, and then this code in ipython:
f = get_pulse_deltas("warped.au")
time_weighted(f[:100000])
gave me
Calculation time: unweighted: 51.10803 sec, weighted: 0.02658 sec, speed factor: 1922.89 x
But if you're getting reasonable results unweighted, it may just be Python-native ops being slow again.
As for local dewarping, it's still kind of kludged together; I'm considering how it may be made more principled, so "long-term" sounds perfectly good.
(If you've played MZX, perhaps you've also played Univers? That's mine.)
Yeah, I figured out I could just run the python version about five minutes after I posted!
I have the weighted code working now --- yes, it's drastically faster, even in C++. For a representative track, the unweighted code was about 200ms, the weighted too short to measure. That's fine, but I had one gigantic fifty-revolution track where the unweighted code took minutes and the weighted code about 200ms... It's probably possible to improve performance again by converting a lot of the code to use integers, but I don't think that's necessary at the moment.
Re using this for real: the simplest thing would be to work on an entire track at once, analysing and classifying each interval, and then converting that to a raw bitmap. This would drastically simplify the decoders. For cases where different records use drastically different clock speeds I think I'd prefer to rely on the local dewarping. The one place this would break down is for disks which use multiple different formats on a single track: if some sectors are two-band FM and some three-band MFM, then this could confuse the classification. However, I haven't actually seen any yet. The only multiformat disk I've encountered was a Amiga/PC disk, which used MFM for both formats.
(I stopped using the old clock autodetection logic --- which worked by finding the biggest peak in the interval histogram and using that as the clock --- because it was too easily spoofed. For example, an MFM disk containing normal data typically has the biggest peak being the bottom one. But an MFM disk containing all zeroes has the biggest peak being the second one. But the k-median algorithm doesn't suffer from this. A massive improvement!)
So having reworked the decoders to use the new system, I've tried it in real life. It seems to be marginly worse than the old code. This could be something I'm doing wrong with regard to classifying the intervals based on the cluster sizes, but I've found one Amiga disk where the kmedian classification is simply wrong:
Clock detection histogram:
44 3.67 425 ▌
45 3.75 12098 ███████████████▉
46 3.83 30440 ████████████████████████████████████████
47 3.92 30165 ███████████████████████████████████████▋
48 4.00 11542 ███████████████▏
49 4.08 1258 █▋
...
66 5.50 220 ▎
67 5.58 952 █▎
68 5.67 1824 ██▍
69 5.75 2247 ██▉
70 5.83 2058 ██▋
71 5.92 1211 █▌
72 6.00 351 ▍
...
90 7.50 166 ▏
91 7.58 252 ▎
92 7.67 372 ▍
93 7.75 412 ▌
94 7.83 321 ▍
95 7.92 164 ▏
...
Noise floor: 304
Signal level: 1522
Band #0: 3.83 us
Band #1: 3.92 us
Band #2: 5.83 us
You can see that the third peak is very small, which appears to cause the algorithm to miss it. As a result, bands #0 and #1 are both referring to the first peak. This track mostly contains zero bytes, so there are an awful lot of 0xaa MFM bytes, which is why the first peak is so massive.
I assume this is a failing in the way the weighting is done. The obvious solution is to add an option which falls back to the old non-weighted algorithm --- this can actually happen automatically if we observe that the band spacing makes no sense, which is what's happening here --- but I wondered if you had any better suggestions.
Weighted and unweighted should always produce the same result; if there's any sequence where they don't that's an outright bug. The "weighted" means that instead of recording, say, a million points at 47 as a million points, it records (47, 1000000), so that the value 47 gets a weight of 1M explicitly instead of implicitly.
I've had similar problems with the local dewarping: some sector has a run of only one or two bands, and that confuses the algorithm into assigning two cluster points to one band. Ultimately, the problem lies in that it's hard to distinguish between a few excessive outliers and a band that just happens to be sparse.
Could you give me a flux file of this Amiga disk? If it contains private information, could you give me a single track or sector - enough to demonstrate the problem? It'd be easier to experiment with mitigation measures if I have data to work on.
(Could you also give me an example image where it does worse than your original approach? I could test if there's something wrong with the classification or if the k-medians algorithm itself fails.)
I've attached it --- it's a simple Workbench image. I think. (I should get an Amiga; I hear they're cool.)
I'd think that the most likely candidates for problems is that somehow I managed to botch the weight calculation code --- I did verify that the cdf array for the test case is identical to the Python version, but of course that's just a test case. That's here: https://github.com/davidgiven/fluxengine/blob/kmedian/lib/kmedian.cc#L48
The easiest way to test is to build the kmedian branch and then do:
./fluxengine inspect -s amiga.flux:t=0:s=0
That will then show the histogram and the positions of the detected bands. --bands=3
will let you select how many bands get detected.
(It's horrifying, but using a large number of bands, like 10, shows very distinct groupings which are easily detected --- after all, physics requires them to be at least 1us apart. This would fairly straightforwardly allow automatic detection of the number of bands. Right now each decoder needs to specify how many it's expecting.)
You could write a tester by generating a random sample of say, 10 integers uniformly drawn from either [22, 26), [30, 35), or [40, 47) (for each of the 10, choose the interval at random and then choose a number within the interval). Then test it on unweighted and weighted both. If you get different results, then you have your test case. Otherwise loop back to random testing. (Or use a fuzzer, if you have one.)
I'll check what Python gives.
Bad luck:
pulses = get_pulse_deltas("tracks/amiga.au")
kmedian.sum_distances(pulses, np.array([45., 46., 69.]))
gives 80745.0, and kmedian.sum_distances(pulses, np.array([46., 68., 92.]))
gives 82977.0
(lower distance sum is better)
So this is no fluke. I have to think how to deal with it. The obvious way would be to use Euclidean (squared) distance and thus do k-means instead of k-medians, but that will make it more vulnerable to outliers, too. (And I just checked - it doesn't help here.)
I found some kind of bug, though: it seems the combination of my logic and your code doesn't like floating point values (probably numerical precision problems). Not multiplying by US_PER_TICK in getPointsFromFluxmap
, and instead multiplying each cluster value by this just before returning the cluster values to the rest of the software, seems to improve results considerably.
That is, float point = fmr.findEvent(F_BIT_PULSE);
in getPointsFromFluxmap
, and something to the effect of
std::vector<float> second = optimalKMedian(getPointsFromFluxmap(fluxmap), clusters);
for (int i = 0; i < clusters; ++i) { second[i] *= US_PER_TICK; }
return second;
in std::vector<float> optimalKMedian(const Fluxmap& fluxmap, int clusters)
.
I've also found a quick hack for Amiga. The problem is that the lower cluster is too "heavy" and drags other values towards itself. But it's not so heavy as to completely dominate the result. So you can first generate a clustering of, for instance, 5 clusters, and then do a 3-clustering on the result. That down-weights the very heavy lower cluster and thus the second clustering produces the right result.
I'm increasingly thinking that the correct approach is closer to k-center (with outliers) than k-median, because there's fundamentally no way to mathematically prevent the lower cluster from pulling too heavily within the context of k-median. I might also need prior knowledge (e.g. of the sort "bands are at least 1 us apart") to better inform the algorithm. In the long term, I might implement k-center with such cues, but the down-weighting approach should work for now.
EDIT: Some preliminary testing with my set of 256 floppy disk images: the hybrid approach (double clustering) had fewer bad sectors than the non-kmedian approach on 50 (19.5% of the total), more bad sectors than non-kmedian on 11 (4.2% of the total), and the same amount on the remaining 195 (76%). Of these 195, 138 had no bad sectors with either approach. I have not checked if any of the same had different bad sectors but the same amount.
Do you have any specific test case for the precision issue? I'd like to look into that more; eventually I'd rather like to convert the algorithm to use integers and tick counts instead of floats, and would like to know which bits require precision. Most of the algorithm is simple summing, so I would have thought precision would be required.
I've played with double clustering, and I'm not entirely convinced. It introduces a new tuning parameter, which is the number of clusters in pass 1 vs pass 2, and getting this wrong appears to produce much worse results than without double clustering --- and the best value changes from image to image. I need to do some more thorough testing, however.
Speaking of tuning: I'm surprised that the original kmedian algorithm doesn't have any parameters. Normally I'd expect that if it were possible for one peak to pull too heavily on another then there'd be some way to determine how hard it was pulling, but there isn't anything. I normally associate parameterless algorithms with mathematically 'perfect' ones, where there is only one possible correct answer, but that's not the case here. While statistics is mostly a closed book to me (I chose the mechanics module instead...) this seems peculiar.
I couldn't find a test case, so I might have been mistaken. I'll let you know if I find any. However, I did find a case that produces an overflow or loss of precision due to the track being very long. https://github.com/kristomu/flux-analyze/tree/master/tracks 's RETRY-77-4-t69.0.
As for the integer port, here's mine: kmedian_int.zip
I just turned sumToHere into a 64 bit unsigned int (to overcome the overflow problem), and then followed the suggestions of g++ -Wall -Wconversion
. I've done some tests: it seems to work fine. There's a possible long double -> double truncation with the implicit M functions, but I don't think it should be a problem.
(Some floating point is still required because medians can be between values, e.g. 23.5)
About tuning, well, you're right. The k-median problem itself is like "find the best linear fit by the least squares metric", in that there are no tunable parameters. It just does what it says. In k-median's case, that is to find a number of points where all the other points have a minimal distance to the closest cluster point. But like outliers may throw off a least squares fit, k-median may get too attracted towards heavy point concentrations. The algorithm gives the perfect answer... to the wrong question.
It is possible to generalize the problem. Consider k-median as one particular instance of the problem: you take every point, throw out a certain fraction of them, and then try to find the clusters that minimize maximum distance from the other points (i.e. no longer the sum of distances, but the worst distance possible). If I recall correctly, k-median is equivalent to throwing out 1/(number of clusters+1) of the points and then doing this process.
This throwing out the points is what makes k-median robust to noise, and thus makes it easier for it to read corrupted tracks. However, in cases like the Amiga image, what it throws out as noise is actually relevant data. After the third band has been thrown out, it then proceeds to classify the other two bands perfectly, and assigns the third cluster to the first band because it has to go somewhere.
Double clustering makes it throw out less by increasing the number of clusters, which decreases the number of points it can throw out. It's somewhat of a hack, because at some point, the extra clusters start pulling on the values again. So in the ideal case, we'd want the fraction of points to throw out to be a parameter. The parameter could be set low for noise-free tracks and high for noisy ones.
But there's a point to all of this that we shouldn't disregard: the system can't know if the odd points out are noise or not. (Not without more cleverness or priors; I may have some ideas, but that's a different matter). So it can't know if it is to throw out a lot or only a few. That means that it has to make multiple attempts with different parameters.
So it might be a good idea to have a flow for damaged tracks that goes:
To some degree, I can use information about how bands are typically aligned to get around having to decode multiple times. E.g. the band timings should fall on a line (1 us, 2 us, 3 us; or 2 us, 3 us, 4 us). But it's not foolproof.
EDIT: The point about k-median being equivalent to dropping a certain number of points is false, but the general point is right: k-median is less sensitive to extreme values than k-center (the "minimize maximum distance" version), and that is both its strength and its weakness.
Here's a file that does worse with float than with integers: 88-flux.zip
1 good sector with float, 8 good sectors with non-median, 16 with integer.
And because I just couldn't let go: here's a k-center with outliers implementation!
It gets some marginal tracks, but it's not as much of an outright slam-dunk as I would've hoped, as there are other tracks in my testset that your approach gets that mine doesn't. I suspect that dewarping (or something like the floppy disk separator hardware's PLL clock drift compensation) could help with that, but I think I'll take a little break from this particular problem now :-)
I've implemented a different strategy now, which I think works better than kmedian or kcenters alone. It's based around sign bit patterns (increases and decreases in values), but is less format agnostic (i.e. needs different details for MFM, FM, GCR...).
Consider the A1A1A1 preamble, and suppose the shortest pulse delay category is band 0, then band 1 is longer and band 2 is longer still. Then the beginning of such an A1 sequence (starting with a zero bit from the 00 just before the A1A1A1) manifests as a pulse in band 1, then 2, then 1, then 2, 1, 0, and so on.
That means that whatever the actual pulse delay intervals corresponding to bands 0, 1, and 2 are, you would see first some pulse delay (in band 1), then a longer delay than the last (band 2), then shorter delay than the last (back to 1), longer, shorter, shorter... And it's possible to search for "shorter than the last one" and "longer than the last one" without knowing what the delays are.
Then once you find a potential match, you just assign the pulses to each interval (the first pulse must be in band 1, the second in band 2, the third in band 1, so on), and infer an interval and a clustering consistent with those pulses.
I've updated my flux_analyze python script to use this strategy, and it seems to work quite well. Since it has no preconception of where the bands are, it should find every preamble that hasn't been subjected to warping of the disk medium, or that hasn't been corrupted (missing transitions, too heavy magnetic noise, etc).
Does that seem like a better strategy that you'd like to implement? Alternatively, I could try to test it on more full flux images to get more evidence that it works, so that we don't get a repeat of what happened with k-median; but I'd need to know more about the .flux format itself to do so, as going the long way around .au is getting a bit cumbersome. Do you have a specification of the .flux format anywhere?
Sorry, I've been buried in other things --- I owed you a reply...
That's an interesting idea. It's actually not dissimilar to what I'm doing right now, which is to look for a pattern of intervals which has the same ratio as the one in the search template. That code handles multiple length gaps, though, and I hate the code. I'd love to simplify it.
The most obvious problem I can forsee is spoofing; it can't tell the difference between 1111 and 01010101 (or 001001001001). I ran into this with my early clock detection code. A decent sync sequence shouldn't suffer from that, but right now it's too late for me to be sure.
A question: if the search pattern looks like 11010011, then the search template's going to be interval+longer+longer+shorter. If we assign the interval value 1, then that will produce the intervals 1232. However, that last 11 should have the same interval as the initial pair of pulses --- 1231. How do you handle this?
The problems you mention seem to rarely occur because of a combination of how the logic works, and some rather fortuitous properties of the sync bit sequence.
The way I do it is: I first match an increase/decrease sequence, then I pull the matched pulses into interval, and then I check that they don't overlap. If they do overlap, the algorithm considers the match to be false and doesn't do anything more with it; otherwise, it uses the intervals to construct a clustering that assigns every delay inside the kth band interval to the kth band.
So, to first answer your question, suppose you have a pattern that looks like 10 100 1000 100, i.e. bands 1232, and suppose the search pattern falsely matches to the flux delays 20 30 40 20 (i.e. 1231). Then the interval categorization will put the first delay into band 1, the second into band 2, and so on. The result is that both the first and second band now contains the delay 20. The intervals overlap, so it can't work and so the algorithm throws that match out. The increasing/decreasing search pattern is thus just the first step, which is verified by whether it's possible to group the matches pulse delays into non-overlapping bands.
Spoofing could happen - it definitely could happen with a maliciously engineered flux recording - but I haven't happened upon any happening by chance. Two nice properties of the sync sequence works in its favor.
First, the sync sequence pattern hits every band. That means that you can't get a mismatch due to every band being shifted (e.g. mistaking 10 10 as 100 100) because such a false sequence would either need a fourth band that's misrecognized as a third band, or a zeroth band that's misrecognized as a first band; neither a fourth or a zeroth band is possible by MFM rules.
Second, it is pretty long. Noise could make pulses in the same band look like they belong in a different band or vice versa, e.g. 20 34 42 21 in the example above would match 1232 with a very small band-1 interval and a very large band-2 interval. But the noise within each band seems to be random, so the chance that there's not a single band violation in the whole of a false match is very low. At least that's what I think is happening as I don't see any such false positives.
If we want to be sure, it would be a pretty easy matter to check for any outlier matches (e.g. very small or large intervals), and discard them. Or discard any matches that come right after a too-short DAM or IDAM, by the reasoning that those pulses belong to the previous structure instead.
(A final note: the algorithm doesn't work for finding consecutive runs of the same band, because the added noise could perturb the points in any direction. E.g. a run of two band-1 pulse delays could be 20 21, 21 20, or 20 20. I treat those as wildcards in the search algorithm -- fortunately, neither the A1A1A1 or C2C2C2 sequences have any such runs.)
Well, that was stupid. I was looking at the wrong bit of the side byte, which meant I was thinking that all sectors were side 0. Please try again --- I think you'll get a clean read. (Client update only, no new firmware needed.)
Thanks! That works.
I don't remember saying I had this problem, but maybe I did.It's either that or you read my mind :-)
Yeah, that comment was supposed to be on a different bug... anyway, thanks for testing!
Re the proposed detection logic: I think I understand how it's supposed to work now --- the template knows which band each interval is destined for, right? And the shorter/longer/same matching is just used for matching the template against the pulsetrain?
That sounds very plausible. Compared to the existing logic which searches for beginning-of-record markers and calculates the clock from the length of the matched pattern, the new logic would be simpler and faster. TBH, I'm not sure it would be actually more accurate, but being able to get the individual interval lengths would be a win all by itself and it's a good opportunity to overhaul the matcher code. I'll see if I can make it work and report back.
It'd be a shame not to be able to use actual statistics, though...
That's right about the template. The whole thing is a two-step process: first find potential template matches by increasing/decreasing, then verify by placing the delays into band categories.
I have an example flux where my python code does much better than current FluxEngine; I'll see if I can upload it. I think it only has 0xCF or 0x00 data bytes (i.e. is unformatted, no data I can't share), but I'm not entirely sure. Maybe I'll try to send it to you by mail instead, although I think your mail provider ate my mail last time I did that.
Some further notes: I don't recover a base clock frequency this way, I directly recover band intervals. This seems to improve accuracy. Secondarily, I determine every consistent extension of the interval by reading ahead, and then I try them all (get_all_bands and full_brute in mfm_decoder.py) which seems to further improve the results.
With full_brute, everything that could be recovered by magically guessing band intervals and then getting lucky (or by using stats) should be recoverable by the increasing/decreasing strategy. So there simply should be no need for fancy stats.
(The only problem is if it returns, say, five thousand valid extensions, all of which produce an error-less decoding. Since checking against CRC 5000 times would give an unacceptable rate of false positives, the brute approach would not work in that case. But none of the sectors I've encountered has been like that.)
EDIT: A suggestion I haven't implemented myself: if you can, search for the whole preamble (e.g. A1A1A1F8, FB or FE; C2C2C2FC), don't just search for A1A1A1. Including the Fx both gives more points to determine the interval with, and tells you which chunk type it is ahead of time.
https://github.com/kristomu/flux-analyze/blob/master/tracks/RETRY-77-4-t69.0.flux
Current FluxEngine gets 12 good sectors and one bad sector. demonstrate() gets 15 good sectors without full_brute (also if the semi-brute around line 150 of mfm_decoder is commented out), and 17 good sectors with full_brute.
I'm not sure FE has found 12 good sectors. It writes some dubious non-zero bytes into the output IMG. And if you generate an SVG, it shows only 6 good sectors.
If you're getting 15 genuine good sectors that's very good indeed! The input flux looks a mess.
I've been running FluxEngine on some floppy disk images I dumped with KryoFlux a few years ago, trying to recover data that KF could not. I have found sectors that can be recovered with FluxEngine but not KF, but also sectors where FE fails and KF succeeds.
In particular, I have an MS Plus disk with some jitter/warping of the surface that causes clocks to vary in a consistent pattern. This seems to be marginal enough to make FE lose its bearings. I've dumped an affected track here: https://github.com/kristomu/flux-analyze/blob/master/tracks/MS_Plus_disk3_warped_track.flux
I also fell down a rabbit hole of trying to make an MFM decoder myself, and wrote this: https://github.com/kristomu/flux-analyze which does recover every sector from the warped floppy, and that I think would be quite robust (and not need tweaking parameters). I'd be interested in porting the k-median algorithm over to C++ (if you'd like to use it in FE), but I don't know how to integrate it into the rest of FE.