Closed DurandA closed 3 years ago
Only sketches so far. I wanted to make sure I could get the self-provisioned fuse burning to work, which is an undocumented process and has involved a lot of effort (and broken FPGAs) to prove it out (otherwise you have to use an untrusted x86 host running a proprietary tool to provision your secret keys).
With the latest set of commits though, I think I finally have an API for self-setting of arbitrary bits in the fuse arrays, but because burning the fuses is a one-way process I now want to build more of the rest of the system around it before burning up more FPGAs.
Long story short, now is around the time I start digging into implementing these in earnest. My first stop will be to look at the OpenTitan cores for AES and HMAC -- seems like a good place to start instead of writing it from scratch. The TRNG, we have an external hardware TRNG via an avalanche generator, and I would also like to include a ring oscillator based TRNG inside the FPGA as well to protect against different threat models (one where you don't trust the silicon vendor, the other where you don't trust the external environment). I've read a couple papers on TRNGs in FPGAs, seems do-able, my general approach will be to LOC an odd number of LUTs configured as inverters around the corners of the FPGA and have the router try to connect them up (so as to pick up as much routing fabric noise and jitter as possible between them) and then use that as a ring oscillator. I'd then make two of those ring oscillators and use them as inputs to flip flop data/clock, hopefully yielding an output that contains some entropy.
One thing I'm looking for are hardware blocks to test the quality of the entropy. If you come across anything like that, I'd love a link.
I just looked at the CrossLink-NX. I was not aware of it. The advertising material claims lower power, but unfortunately, the datasheets explicitly omit any power numbers and say "go to our power estimation tool, really". I downloaded the app note and it also lacks numbers. Not really willing to go through the effort of downloading and installing a proprietary toolchain just to see if a datasheet claim is real or not -- you'd think if their static power numbers were awesome they would just include them in the datasheet. I think the asterisk is that it looks like they included fine-grained power control of sub-blocks, which lets you turn off various unused items to save static power. This is a nice feature, and it should definitely help a lot. However, 7-series already has a version of that via DRP ports for the ADC and MMCM, which are big culprits in dynamic power, and the main thing I'm worried about is static power because in theory the FPGA should be off "most of the time" as the LCD will retain an image with no intervention from the drive logic.
At the moment, however, if I were to consider an alternative FPGA (which I'm not because we're so far along with the 7-series now), it would be the PolarFire FPGAs from MicroSemi. They also claim "lowest power" (but because they are FLASH based they have a real reason they could deliver on that) and they have DPA-hardened crypto primitives and stuff like ECC on internal RAM which makes them less susceptible to fault induction attacks. If only these guys were supported by Symbiflow...
Thank you for the detailed explanation.
I've read a couple papers on TRNGs in FPGAs, seems do-able, my general approach will be to LOC an odd number of LUTs configured as inverters around the corners of the FPGA and have the router try to connect them up (so as to pick up as much routing fabric noise and jitter as possible between them) and then use that as a ring oscillator. I'd then make two of those ring oscillators and use them as inputs to flip flop data/clock, hopefully yielding an output that contains some entropy.
That is also the approach I use on the ECP5. dpiegdon/verilog-buildingblocks has a simple RNG to get started (make sure to read the comments about the quality of the entropy), Funnily, ring oscillator cannot be created using the proprietary toolchain. I will also publish my code soon when the cores are somehow usable.
One thing I'm looking for are hardware blocks to test the quality of the entropy. If you come across anything like that, I'd love a link.
I am not aware of it but will make sure to contact you if I came across such a core.
Just fyi, I've started on the TRNG core:
https://github.com/betrusted-io/betrusted-soc/tree/master/gateware/trng
Any feedback/comments would be appreciated!
Hi @bunnie!
I have a few questions concerning your design:
I think @dpiegdon will also be interested in discussing your design. Please note that I am an absolute beginner in the field and I would suggest to ask real experts for such a complex topic.
Loosely based on https://core.ac.uk/download/pdf/144146883.pdf
The theory is that you have the "D" pin of the flip flop being clocked with a very fast oscillator, and the "CLK" pin of the flip flop being clocked with a very slow oscillator, but whose jitter period is larger than the period of the fast oscillator. If your jitter period is larger than that of the oscillator, then you are essentially interrogating the "D" pin at a random point between 0->1 on each edge.
I tried two similarly sized rings initially, and essentially they tended to phase-lock and push out very long runs of 0's and 1's. To wit, the "long ring" on the Spartan 7 cuts through the heart of the logic area, so it seems to pick up a lot of timing noise running through there.
Feedback -- not sure I follow?
Pushing enable to 0 should reduce power, but haven't measured it.
Loosely based on https://core.ac.uk/download/pdf/144146883.pdf
Thanks, I will read it within the next few days.
I tried two similarly sized rings initially, and essentially they tended to phase-lock and push out very long runs of 0's and 1's.
Aren't these flips the only random parts of a design based on ring oscillator? I think having different oscillating speeds only makes the signal to look more noisy but not more unpredictable. I will read the thesis as I might be completely wrong here. As a side node, there is a PUF design (see figure below) that take advantage of phase-lock using D-flip-flops to spare counter and comparator logic so the output bit is somehow stable in a short timeframe.
Feedback -- not sure I follow?
I was thinking about something like a linear/nonlinear-feedback shift register to accumulate entropy over time.
@bunnie Not related, but I found an existing AES implementation (paper) using Migen (not nMigen). I have no idea about its quality.
Thanks for the links and comments! sorry, I've had my head buried in bringing up SPINOR DOPI and I2S busses the past couple weeks so I got distracted from doing crypto stuff.
I'll re-open this issue since it seems to be active and there's useful comments here that others might benefit from seeing.
About the two similarly-sized ring oscillators, the problem is that the phase-locking reduces the amount of entropy available. So yes, eventually you do get a random event when one oscillator manages to overtake the other, but because the two trade information between each other, the actual amount of entropy available is reduced.
Timing jitter has several components to it, including a cycle-to-cycle component, and a long term component. Long term jitter can be thought of as like a frequency drift; cycle-to-cycle jitter is an immediate, short-term unpredictability in the period of an oscillator. There's different mechanisms for each of the types of randomness. In the two similar-length oscillators, you're mostly relying on long-term jitter to generate entropy; in the design I've used, I'm focusing on the cycle-to-cycle jitter.
The design takes a couple hundred CLBs and chains them together, deliberately weaving the routing through the entire fabric of the digital design. Thus, at each stage, the exact moment of switching for the signal is affected by mutual cross-talk of adjacent wires, and the local/shared power supply coupling of other logic packed into the same slice (this is a cool thing about the 7-series FPGAs -- basically, you get four CLBs in a slice, they are tightly co-located, and I can tell the compiler to use only one of the four CLBs to route one of the oscillator stages, and the compiler will stick 3 other unrelated gates in the same CLB which increases the noise coupling).
After routing the signal around a couple hundred of these CLBs and doing several laps of the FPGA, I end up with a signal that has a period, if you can call it that, of about 300ns but it has a fantastic amount of jitter on it -- it's all over the place, maybe 10's of ns of jitter on the edges.
The fast ring oscillator is just a way to extract that randomness. I have another oscillator that runs at about 200MHz -- about a period of 5ns -- that is quite low jitter, but still has some jitter and more importantly has little influence on the phase or frequency of the slower ring oscillator.
I use the edges of the slow oscillator (with a jitter in the 10's of ns) to sample the state of the fast oscillator (a period of 5ns). Because the cycle-to-cycle jitter of the slow oscillator is larger than the fundamental period of the fast oscillator, you end up creating a stream of entropy with a bitrate similar to that of the slow oscillator.
Chiming in late here, and only after skimming through your comments. Depending on the entropy quality you need, your approach (few hundred CLBs, loops around the whole fabric) seems rather complicated. If you absolutely must make sure that you have high q entropy, it might make sense of course. But I found that with about 15 LUTs or so I can get a very good RNG source. Just feed the output of four different 3-stage ringoscillators into a LUT4. An example implementation can be found in metastable_oscillator_depth2, with an iCE40 implementation in https://github.com/dpiegdon/verilog-buildingblocks/blob/master/lattice_ice40/random.v , and a proper system implementation for the iCE40 "icestick" dev board at https://github.com/dpiegdon/ringoscillator .
The intention of this chaining is to get a metastable signal. And this works so well, that if you connect this signal out through a pin, you can see the metastability on a scope. This is much more an analog signal than anything else. So this seems to be really very good. I've had very little variation of this over different routings, and even over different chips (different iCE40 HX and LP versions, and even on ECP5). If routed through something like a cryptographically secure hashing, or even a simple LFSR, this should yield very good results. I did some testing, see the inline comments and the ringoscillator README. Also note that the metastable signal often has a bit of a simple bias toward one side. @DurandA found that my implementation of a bias-fixing circuit was far from perfect, so also see the comments on these findings and possible improvements. (verilog-buildingblocks/binary_debias.v, and verilog-buildingblocks/random.v for a full debias-into-LFSR chain, as used by the ringoscillator).
Hope I could give some useful hints, David.
We've finally gotten around to doing a fairly rigorous characterization of our TRNGs, and have improved the designs quite a bit since this thread was opened. You can find our latest results here:
https://github.com/betrusted-io/betrusted-wiki/wiki/TRNG-characterization
I'm reasonably happy with where they are; we're now implementing long-run CI testing (where we have an actual hardware unit hanging off of the CI server and constantly looping results into dieharder
) to get a wide-angle view of the performance on tests that will probably take up to a week to run without rewinding data.
I was able to get away with no de-biasing or whitening for either source. For the Avalanche generator, this means considering only the lower 4 bits of the ADC to be "real" entropy and XOR'ing four consecutive 4-bit-shifted reads to create a 16-bit result. For the Ring Oscillator generator, this meant a full re-architect of the original concept. I won't explain the process here, as it's all documented in the link above, but the new architecture is able to create entropy on 32 bits in parallel with no detectable bias at a rate similar to the Avalanche generator.
A meta-wrapper around the two sources will also be crated so that we can create a tidy TRNG programming API that provides a 36kib FIFO of random data, so that the hardware can meet burst-demand for random data, and we'll also incorporate some basic "flatline" on-line health tests which can detect and flag if something has gone horribly wrong with the TRNGs on-the-fly.
This is a comment from Luke Kenneth Casson Leighton lkcl@lkcl.net, which was sent via email but pasted here with his permission as it is relevant to this thread.
---------------- original comment begins here and extends to end of this comment box ---------------
i developed an encryption algorithm some time ago, and had to develop an iterative methodology to ensure it was viable.
the basic principle is that the encryption algorithm is used as a PRNG, deliberately using CTR mode because, ultimately, any one bit of the input should change, on average, 50% of output bits. given this fact, encrypting regular sequentially patterned data should be a strong test of an algorithm.
however you are doing an actual RNG, here, but the principles are the same: you want to know if the output is random, and in particular you want to "guarantee" that, always.
the kicker: you can't.
the problem is: it is impossible to tell if it is always going to be random. the worst nightmare for any encryption algorithm or RNG is that it can suddenly fail. producing biased data or, in encryption suddenly outputting cleartext.
the best that you can do is say, "at least so far we have not detected a problem".
there is, hilariously, a quirk associated with True RNG sources when it comes to testing: actually, real RNG sources can produce runs that genuinely are long runs of 1s or long runs of 0s! it is the PRNGs that are at "fault" here, by not producing such outliers, not the TRNGs!
this has to be taken into consideration when doing "realtime checking" because regularly throwing away a given batch of values just because it happened to fail a "small window" test could actually end up introducing long-term statistical bias.
this is the mistake that many people make on RNG analysis: you can't just test one short batch. you can't even just test hundreds or thousands of batches. you have to perform hundreds to thousands of batch tests, then have a mathematical model that checks the statistical variance (p-values) between the batches. then, finally, test the variance of the histogram of the p-values.
yes, that's literally "a test of the test of the tests".
to illustrate in simpler terms: the most face-palming moment i heard 3rd hand was a story of someone who had "rolled their own" encryption algorithm. they were immensely proud of the fact that it had been designed to be "better than standard encryption algorithms", which as mentioned above are based on the principle that if one bit of the key or one bit of the data changes, on average 50% of the output will change.
no, this person was terribly proud of the fact that exactly 50% of the output changed, every time, without fail.
aside from creating catastrophic failures on 1s and 0s counting tests on a per-block basis (exactly equal number of 1s and 0s), this would also result in p-values of exactly 0.5 on tests-of-tests: also a sign of catastrophic failure (non-randomness).
all of which illustrates that this is not something that you decide one morning to do lightly: it's a serious long-term committment of some serious resources and mathematical (statistical) knowledge.
and, as you're discovering, Bunnie, it also takes some serious computing horsepower. when running STS i bought 8 computers, networked them together, booted them all from NFS root and ran multiple multi-gigabit runs for several days at a time, repeatedly, over a 6 month period.
however what i quickly found was that i was wasting my time by jumping straight to multi-gigabit runs.
bear in mind i mentioned above that the principle is: you can't guarantee true RNG output but that you can detect, statistically, when data is likely to not be random.
the key insight here is that the smaller the "flaw", the harder it is to detect. conversely, the larger the flaw, the easier it should be to detect.
i quickly found, therefore, during the development of the algorithm, that the "large" flaws were very easily detectable by the much faster, much quicker program, DIEHARD, using as little as 10 megabyte files.
given that DIEHARD runs in a few minutes, given that it can test files as small as 10 megabytes, this massively speeded up development time.
however, going back to that observation: clearly it is completely inadequate to only run DIEHARD on 10 megabyte files. that will have caught only the "crudest" of flaws.
the next step is, then: to run DIEHARD on 100 megabyte runs. this is the next level up that catches "less crude" flaws.
you can see where that's going: you eliminate the biggest flaws with the fastest statistical tests, correcting them on a fast iterative development cycle, and move upwards to progressively more stringent tests that take longer and longer to run.
after wotking through some of DIEHARDER's multi-minute tests i then moved on to CSRC.nist.gov's STS. i first tested:
the key here: at any time if any one of the tests showed any kind of weakness, even the slightest bit, it was strictly and categorically considered to be a catastrophic failure.
this draconian level of judgement is of absolute paramount importance. i cannot overemphasise here enough how critically important it is that any sign of "weakness" in the statistical output is considered to be an outright failure.
this is purely and simply because of the fact that you cannot, ever, declare a RNG to be "truly" secure. the only weapon that you have is these statistical tests and you better damn well listen to it when it says "fail".
after six months of examining output, every day, you get a "feel" for the patterns of the p-value testing, and for the histogram outputs from STS. when do you stop testing? honestly: you will know.
but it is imperative here to understand and appreciate that just because you have run a few tests, even if they "pass" just the once, this is nowhere near adequate to give confidence.
i ran hundreds close to a thousand separate runs on the algorithm i developed, examining the output of every single one of them, closely, to determine if they looked suspicious. p-values close to 0.0000 or 0.9999 were treated as suspicuous: DIEHARD, DIEHARDER and STS all warn you, but you also need to use your intuition.
one critical thing i found on STS, which only came out of such long-term persistent observation: the Lempel-Ziv "longest pattern" test turned out to be flawed.
in contacting CSRC i learned that absolutely nobody else had found this flaw. the problem is down to the fact that the test seeks out longest unique patterns, but that these patterns produce discrete values of lengths that are, by definition, whole numbers. typically a 100kbit run would have a longest unique pattern comprising 127 bits, give or take 10 either way. the small range, there, should tip you off.
when turning those into p-values, although they were correctly turned into p-values, the problem was similar to how digital audio sampling of pure sine waves produces artefacts when very close to a multiple of the sampling frequency: the p-values were not uniformly distributed, they jumped in incremenents of around 0.049 due to the limited range of the pattern lengths, which was too close to a multiple of the histogram window (0.1).
consequently, with this "sampling", the p-value histogram bucket for 0.5 and 0.6 showed an absolutely miniscule anomalous skew. i initially thought that my algorithm was at fault: i therefore tested a known-good PRNG source and the same flaw showed up. i therefore suspected the test and further investigation showed why.
now, the thing was: nobody else had reported this because nobody else had run 1,000 or 10,000 runs of 100k bits, and on the smaller runs (100) the number of samples of p-values going into histogram buckets was too few to be statistically significant for the p-value-of-p-values (test-of-test-of-tests) to pick up.
the point being: even experienced statisticians do not necessarily know every little flaw in the test suite that they have developed! consequently, what i advise you to do is to find several sets of independent TRNG sources, produce exactly the same length of bits, and re-run just those tests that came up as "Weak" in DIEHARDER.
if that also fails, you might find that, actually, it's the test that's at fault, not the algorithm. but be warned: it could equally be the case that there exists some extremely strange patterning, coming in from an indeterminable source, that is genuinely introducing non-randomness that only that particular test is capable of picking up.
you need to use your judgement here, and do a thorough investigation.
i hope that this gives you some insights into what you have committed yourselves to. you're in it for the long haul, and yet you do not need to burden yourselves with drastically slow computationally-expensive testing. there is a way that you can do much more rapid prototyping, despite the responsibility involved.
you also unfortunately cannot implicitly trust the tests themselves, and need to verify them against alternative independent PRNG and TRNG sources.
also, remember that some forms of "whitening" will actually, far from "cleaning" the source, introduce long-term statistical artefacts that are only detectable by seriously long testing runs. you particularly need to take that into consideration when using commercial 3rd party closed-source TRNGs, if they have been "whitened", when doing the independent verification of the Statistical test tools.
Another add-on from lkcl, when queried about a link to the STS tests. Thanks Luke!
https://github.com/kravietz/nist-sts
it was a pig to find even 15 years ago, and.. ah! here we go: https://csrc.nist.gov/projects/random-bit-generation/documentation-and-software
that's the original download page.
lkcl pointed me at this when we were discussing the TRNG that I implemented in Microwatt.
Regarding the dieharder results on your wiki page, my comment would be that 536MiB of random data is nowhere near enough for a full dieharder run (with -a): look at how many times your input file got rewound. In particular for the marsaglia_tsang_gcd test it got rewound several times just for that test, so it's no wonder it reported failure. I believe you need something like 200-250GB of random data for a full test without rewinding, though you could instead use the flag to rewind at the beginning of each test (-s 1) and then I believe 10GB should be enough (for -a with the default parameters).
Thanks, yes. I should update the page with the latest results, they just take an eternity to run. We have long runs that are done on about 150GiB of data with no rewinding, but they take a month to run. we also have a 'fast' evaluation data sets that we run which take a few days and total up to a few GiB where we drill into a couple of tests that are known to cause us trouble.
The 500 MiB results are in fact just quick 'smoke tests' - if we can't pass with that we don't do the longer runs. it's just that... well, the test is still running so we haven't been able to update the wiki with final results yet on the big run 😅
Another note from lkcl, as below:
i learned last week that Paul Mackerras also implemented a TRNG, in microwatt. i believe it to be a different FPGA algorithm which is where it gets interesting, in that you can now (both) do comparative analysis and potentially eliminate flawed tests.
in speaking with Paul i learned that DIEHARD(ER) does not contain the absolutely critical test-of-test-of-tests, which STS specifically does.
i recalled a question that you asked me, "how do we do long-term output analysis" and i realised, retrospectively, that the answer is, "use STS".
as you can see from the LibreSOC mailing list archive post above i explain a bit more about how STS does long-term histogram analysis of the individual runs (analysis of batches of 100,000 or 1,000,000 bits).
this is CRITICALLY important to do yet neither diehard nor dieharder do it.
this would explain why you even asked the question: i had forgotten (because it has been 15 years) that diehard does not do histogram analysis, and it looks, unfortunately, like the authors of dieharder have "extracted" STS's spot-checks but completely misunderstood why the histogram testing is a critical part of the process and so left that out.
i cannot overemphasise enough how important these tests-of-tests-of-tests are. CSRC really know what they are doing, the STS paper is especially important to read, and from it you (and readers / independent reviewers) will appreciate why any "weakness" must be viewed as a "catastrophic failure", but also that diehard and dieharder alone are initially useful (to save time through early elimination of faulty potential candidate implementations) but ultimaltely on their own are inadequate.
Dieharder does in fact do a Kolmogorov–Smirnov test on the p-values from 100 or 1000 (more if you use -m) repeats that it does of each test and prints the overall p-value for the KS test. In my opinion this is a better and more accurate assessment than the kind of analysis that STS does, which applies a threshold (for the 99% confidence level) to each individual test p-value to get a pass/fail for each individual test, and then applies a threshold to the number of passes to determine overall pass/fail. That second threshold is set at a level where a truly random source will fail 0.13% of tests. STS does not fail tests for having too many individual passes, only for having too many individual fails.
Dieharder computes the overall p-value for the KS test on the individual test p-values, which is expected to be uniformly distributed in [0, 1] for a truly random source. It applies both upper and lower thresholds on that overall p-value at levels where a truly random source will give a "weak" result 1% of the time and a "failed" result one time in a million.
Thanks for the insights. I had a quick question about the dieharder code base, if you've taken a look in it. It seems they use gsl_rng_uniform()
(or some equivalent variant) of it to generate random numbers as "double precision" floating points. The methodology looks like it takes a binary number, casts it to a double, and then divides it by the maximum value of that integer, to arrive at a fracitonal number from zero to one.
However, double precision isn't precise in representing integers -- even though the mantissa in double precision is 53 bits, I have to imagine with a large enough sample size you're going to run into numerical errors that end up biasing the result, or worse yet, end up whitening the TRNG by masking patterns that match the patterns of rounding precision built into the format.
However, I could completely also believe that my concern is moot because maybe it's provable that any 32 bit integer can be fully represented without loss of precision within a double by doing this. Do you know if this is the case?
I also have concerns because the "lagged sums" test works by summing a million double-precision values, and then dividing by a million -- so it seems this should be more sensitive to errors in the MSBs, and could shed sensitivity to biases in the LSBs.
However, I could completely also believe that my concern is moot because maybe it's provable that any 32 bit integer can be fully represented without loss of precision within a double by doing this. Do you know if this is the case?
Yes, it's provable, that 32bit int
fits without any loss of information into double
assuming the underlying float standard is IEEE 754 in its current version (i.e. the latest revision) and there are no flaws in the IEEE 754 implementation (which is actually nearly impossible to find - see some examples in this thread).
I can't comment on the other points raised as I'm not versed enough in these topics.
However, I could completely also believe that my concern is moot because maybe it's provable that any 32 bit integer can be fully represented without loss of precision within a double by doing this. Do you know if this is the case?
Yes, it's provable, that 32bit
int
fits without any loss of information intodouble
assuming the underlying float standard is IEEE 754 in its current version (i.e. the latest revision) and there are no flaws in the IEEE 754 implementation (which is actually nearly impossible to find - see some examples in this thread).I can't comment on the other points raised as I'm not versed enough in these topics.
That's good to know. To clarify, I understand you mean that if you took an integer from 0 - 2^32-1, that has a precise representation in double floats without precision loss. However, you are not commenting on what the precision impact might be if you took that integer and then divided it by MAXINT to try and get a number from 0 to 1.0 (as the gsl library does), correct?
However, you are not commenting on what the precision impact might be if you took that integer and then divided it by MAXINT to try and get a number from 0 to 1.0 (as the gsl library does), correct?
Correct - I didn't comment on this part. Actually I'd expect (not that I can prove it), that there actually is some precision loss (despite there is about the same amount of representable numbers in 0..1 interval as is in 1..inf; in other words average density of representable numbers in 0..1 interval is significantly higher than e.g. in 1..2).
On the other hand, specifically about "whitening" I think it's definitely something to have concerns about, but maybe not that problematic assuming the computation is the only thing running on the given FPU (or when switching back and forth between apps, then this switching is guaranteed to be purely random) and that the FPU is set to mitigate any computational biases (at least the rounding mode is Alternating tie-breaking) and the underlying hardware is known to have correct IEEE 754 implementation.
yah...rounding modes. 🤦
TRNG has reached MVP, I think... here's a link to all the docs:
https://github.com/betrusted-io/betrusted-wiki/wiki#trng-chronicles
Since this thread is getting unwieldy, I am going to close it and if there are further issues please open a new, more targeted issue.
I came across your presentation for 36C3 and was wondering if you started working on TRNG and crypto cores:
I couldn't find anything related to that in betrusted-io repos. I am currently working on PUFs and TRNGs hobby-grade cores for the ECP5 using Migen/LiteX and would be interested in your approach.
As a side node, are you aware of the new CrossLink-NX FPGA from Lattice? If you are unhappy with prjxray this should significantly lower the power consumption compared to the ECP5 and @daveshah1 mentioned on Twitter that the architecture and bitstream was very similar to ECP5. In fact, he already started working on prjoxide.