cyberphone / json-canonicalization

JSON Canonicalization Scheme (JCS)
Other
99 stars 23 forks source link

Provide bandwidth-free ES6 testdata #16

Closed dsnet closed 3 years ago

dsnet commented 3 years ago

This is an alternative to #15.

I'd like to write a test that ensures formatting is canonical for a large suite of numbers, but require no network bandwidth when the test passes.

For the set of floating point numbers chosen in es6testfile100m.zip, can the sequence of input numbers be generated with a simple program?

If so, then we can specify a simple program that generates just those numbers, and then have the test generate what would be equivalent to es6testfile100m.txt and hash it. If the hash matches, then we have confidence that it passed the test. If it fails, then we have to download the testfile in order to figure which entry differed.

As far as I can tell, there seems to be a pattern for how the first 1144 entries were produced, but everything afterwards appears random. Do we know how the numbers after that point were generated?

cyberphone commented 3 years ago

Yes, it is indeed possible to do what you propose. All it requires is a pseudo-random number generator using a known algorithm so that all testers get the same values. I'm personally bogged down by related work which is why I haven't been able to progress this. The original file was created by a cryptographically secure random generator so the result is not repeatable. I believe 100M pseudo-random 64-bit numbers + the special cases would be entirely sufficient.

dsnet commented 3 years ago

Here's a short Go program that deterministically generates a es6testfile100m.txt file. The PRNG is based on SHA-256 since we're going to need a hash anyways.

The generator (i.e., the next function) is ~65 lines (~20 for the code itself and ~45 for the global state).

The first 1114 entries of the output are identical to the current estestfile100m.zip dataset.

Here are the SHA-256 hashes of various lengths of outputs:

Entries File size File SHA-256 hash
1k 38054 ace079ffc98dfc66de4a1ea503baa3fd21dcae86c7fc1a4b7470715c825737f0
10k 401124 e1aead772d79a53df95289caf42f04b0f4fe1cf70058040e27bbd8f03a78b11c
100k 4033821 ad12990add6d0b303e356a7aef76c3249ed00ab870fd01ea5d3366630edb48ba
1m 40359517 2b567bd9e82257b5b4ed2bec3e0ecc910722a8566ef0538d0a348c89bf82b9f1
10m 403632090 e48ee378494fa771a9fa109b1b52825cf30bdf4e59601dfc8b4895322d805a8f
100m 4036328199 bed4cf9a666be044bbbe243f3465b666d3b9e1def461932f451aad5ad8c07324

Using compress/gzip, the file is 2081257067B (1.94GiB) compressed, which should be just below the 2GiB limit for GitHub releases (to satisfy #15).

dsnet commented 3 years ago

Here's a better generator: https://play.golang.org/p/yMVOf6kqS27

The main difference is that it incorporates all of the entries from Appendix B of the RFC.

Entries File size File SHA-256 hash
1k 38054 be18b62b6f69cdab33a7e0dae0d9cfa869fda80ddc712221570f9f40a5878687
10k 401124 b9f7a8e75ef22a835685a52ccba7f7d6bdc99e34b010992cbc5864cd12be6892
100k 4033821 22776e6d4b49fa294a0d0f349268e5c28808fe7e0cb2bcbe28f63894e494d4c7
1m 40359517 49415fee2c56c77864931bd3624faad425c3c577d6d74e89a83bc725506dad16
10m 403632090 b9f8a44a91d46813b21b9602e72f112613c91408db0b8341fb94603d9db135e0
100m 4036328199 0f7dda6b0837dde083c5d6b896f7d62340c8a2415b0c7121d83145e08a755272

EDIT: I ported the above Go program to node.js and verified that the results are identical on two different architectures (both on Intel i7 and Apple M1).

cyberphone commented 3 years ago

Thanx for your work!