Open BenWestgate opened 1 year ago
Okay, I solved a great problem:
password=master_seed
indeed has the flaw that zero padding it without changing the header won't change the derived key.
Which is a problem because it changes the master xprv (since bip32 has them reversed).
While len(master_seed) in salt was my original quick fix. And perhaps it still belongs since it's data revealed in the shares. What would be even better is adding the master pubkey fingerprint after the codex32 secret header.
salt=header||fingerprint
A salt is supposed to include a pseudorandom component anyway. And this one is expensive.
Alternatively, salt=master_seed password=header avoids the flaw.
Salts aren't supposed to be reused (they would be here when re-sharing a secret).
Adding len(master_seed) to salt doesn't fix its reuse across different passwords (master_seeds).
Adding Fingerprint is better. Salt will always be unique per unique master_seed OR header.
Wow, this padding in HMAC is a huge footgun. Good catch. I agree with you then that we should only use fixed/non-secret data for the "key" and put our actual key data into "data".
It also sounds like you have good justification from RFCs for all of the choices you're making, which is great because this has gotten complicated enough that it's quite hard to vet.
This is generalizing the function from turning 1 codex32 string into a deterministic share set, to turning any
codex32_strings_provided <= k
into an auditable deterministically generated and shuffled share set.
Innteresting. So, there is a danger here that the user can abuse this to create multiple sets that share some shares (but not others). I don't mind supporting this but we should have some pretty strong warnings saying not to do this, and if they want to have the same person participate in multiple sharings, they should give that person multiple shares.
There is also a danger that users would do this accidentally; they generate some shares from s
as expected, then later they don't have s
but do have a bunch of shares, so they try to regenerate the original shares from that. Or maybe they do have s
, but also have a
, and think "I've got this extra data, might as well enter it right?" and then accidentally generate new shares instead of regenerating old ones.
Wow, this padding in HMAC is a huge footgun. Good catch. I agree with you then that we should only use fixed/non-secret data for the "key" and put our actual key data into "data".
Putting len(master_seed) and a bip32 fingerprint into the salt also solves the password=master_seed 0 padding issue. And ticks more RFC compliance boxes (pseudorandom salt component, salt can be public, salt is never reused across different passwords, password remains secret.).
Innteresting. So, there is a danger here that the user can abuse this to create multiple sets that share some shares (but not others).
Yeah I agree, changing the default identifier
could avoid this danger.
It's not the master_seed
that defines whether shares are compatible but rather the threshold independent codex32 strings used to interpolate the rest of the backup.
If only the codex32 secret is passed, identifier can default to use the bip32 fingerprint.
If shares are passed, a hash of k
shares data (ie data of share a, b and c if k = 3) is a more powerful ID for preventing improper share combinations.
This creates a circular reference if the identifier is in salt and determines the share data for a master_seed. But then the share data provided needs to determine the identifier.
I'll have to think about this more. But a hash of k shares data is definitely the safest default ID for backups made by specifying shares and their data.
I don't mind supporting this but we should have some pretty strong warnings saying not to do this, and if they want to have the same person participate in multiple sharings, they should give that person multiple shares.
Absolutely, and if the default output involves "re-labeling" the data provided, then the need to make a new share set will be obvious. It's the identifier (salt?) reuse that is dangerous.
There is also a danger that users would do this accidentally; they generate some shares from
s
as expected, then later they don't haves
but do have a bunch of shares, so they try to regenerate the original shares from that.
If they have k shares they can already recover everything.
The function is called generate_NEW_shares(), it's not for recovering the original set, we already have that in Python and Rust with ms32_interpolate.
You shouldn't be able to generate a new share set without providing the codex32 secret. If you pass k shares you can calculate the secret but have no degrees of freedom to create any new data, so it'd just set the correct identifier and deterministic shuffle for displaying the dependent shares.
Or maybe they do have
s
, but also havea
, and think "I've got this extra data, might as well enter it right?" and then accidentally generate new shares instead of regenerating old ones.
Titling the functions:
Recover Old Shares Generate New Backup
Is very important to prevent these mistakes, but if you only pass codex32 string data from an existing backup it made, it ought to produce the exact same backup, as it's deterministic. Unless you override the identifier. Then it uses the provided data and fills in the rest. (That's what it's doing in the first case too but you provided identical data as it was going to generate had you not provided it.)
Only if you give it share data that's not from the default output of a given master_seed
and header
will it make something unique. (And relabel it)
This will make more sense if I try to code it first.
Here's my first stab at a reworked Generating Shares, For a fresh master seed:
def fresh_master_seed(bitcoin_core_entropy, user_entropy = '', seed_length = 16, k = '2', identifier = ''):
"""Derive a fresh master seed of seed length bytes with optional user-provided entropy."""
import hmac
import hashlib
share_list = []
share_payload = []
if 16 > seed_length > 64:
return None
if 9 < k < 2:
return None
letters = sorted(filter(lambda c: c.isalpha(), CHARSET))
derived_key = hashlib.scrypt(password=bytes(user_entropy + str(seed_length+k+identifier), "utf"),
salt=bytes(bitcoin_core_entropy, "utf"), n=2 ** 20, r=8, p=1, maxmem=1025 ** 3, dklen=64)
identifier = 'temp' if not identifier # place holder until we have better info
relabel = True if not identifier else False
for i in range(int(k)):
# generate random initial shares
share_index = letters[i]
header = 'ms1'+ k + identifier + share_index
info = str(seed_length)+'-byte share with header: '+header
share_payload[i] = hmac.new(key=derived_key, message=info, hashlib.sha512).digest()[:seed_length]
share_list += [encode('ms1', k, identifier, share_index, share_payload[i])]
master_seed = recover_master_seed(share_list)
if relabel:
salt = b''.join(share_payload)
identifier_data = hashlib.pbkdf2_hmac('sha512', password = master_seed, salt = salt,
iterations = 2048, dklen = 3)
new_identifier = ''.join([CHARSET[d] for d in convertbits(identifier_data, 8, 5)[:4]])
share_list = relabel_shares('ms1', share_list, new_identifier)
return master_seed, share_list
I've not implemented shuffle yet so it returns master_seed
and the k random shares generated at indices a, c, ... etc.
This default identifier
was chosen to give incompatible share sets unique IDs, even if they belong to the same master_seed
.
I think bip32 master pubkey fingerprint is more useful (for fresh master seeds, NOT re-shares) I just didn't know which bip32 library to use or to attempt to do it myself with python's cryptography library.
This identifier
is derived by pbkdf2 using BIP39's cost parameters so that valid shares can't be checked for correctness against the identifier
faster than against an address.
If you like this, I'll make it the default for re-shares when identifier is not specified and suggest to use fingerprint for reshare=False
(when user has an existing bip32 master_seed
or extended private masterkey and wants to encode it in a codex32 share set.)
Identifier.
If you want to use a temporary identifier like this, I'd use 1111
rather than temp
. The current code, I believe, will produce the same share data for a user-provided identifier of temp
as it will for a non-user-provided identifier.
But instead I'd just split up these cases, encode one of
Length:
+ hex-encoded 2-digit length with 0x prefix + ID:
+ identifier + Threshold:
+ kID:
+identifer" replaced with NO IDEN
.Then in both cases you have a string of the same length where all fields can be found at the same offsets and there is no way that one case can be interpreted as the other.
PBKDF.
I'd use the full shares as salt, not just the payload. In particular you want to make sure k
is somehow in there so that the total length is unambiguous. (Currently I think you can get the same salt if you have k = 4
and a 128-bit secret as if k = 2
and a 256-bit secret. Or at least, you need to depend on the security of the HMAC to guarantee this doesn't happen, which is a more complicated argument than just having the length be unambiguous.)
Other than that, this looks good to me!
Re moving the index from position 6 to position 1, @roconnor-blockstream argues on IRC that
Personally I think (2) is a big deal -- when doing the checksum worksheet by hand, it's refreshing to get a "head start" on the worksheet by having the first several lines all give the same characters. It's faster, it reassures you that you're doing the right thing (or at least, a consistent thing).
Also there is the argument that we already printed a bunch of codex32 books and can't change it :).
Also there is the argument that we already printed a bunch of codex32 books and can't change it :).
Fair enough, the constants autocomplete anyways on electronic recovery.
Identifier.
If you want to use a temporary identifier like this, I'd use
1111
rather thantemp
. The current code, I believe, will produce the same share data for a user-provided identifier oftemp
as it will for a non-user-provided identifier.
Decode would fail with identifier = 1111
due to:
if not all(x in CHARSET for x in bech[pos + 1:]):
return (None, None, None, None, None)
If the user provides an identifier
parameter of temp
that will be the final identifier
because relabel
is assigned False.
identifier = 'temp' if not identifier
relabel = True if not identifier else False
if relabel:
Should the functions below not validate the charset?
def ms32_decode(bech):
"""Validate a MS32 string, and determine HRP and data."""
if ((any(ord(x) < 33 or ord(x) > 126 for x in bech)) or
(bech.lower() != bech and bech.upper() != bech)):
return (None, None, None, None, None)
bech = bech.lower()
pos = bech.rfind('1')
if pos < 1 or pos + 46 > len(bech):
return (None, None, None, None, None)
if not all(x in CHARSET for x in bech[pos + 1:]):
return (None, None, None, None, None)
hrp = bech[:pos]
k = bech[pos + 1]
if k == "1" or not k.isdigit():
return (None, None, None, None, None)
identifier = bech[pos + 2:pos + 6]
share_index = bech[pos + 6]
if k == "0" and share_index != "s":
return (None, None, None, None, None)
data = [CHARSET.find(x) for x in bech[pos + 1:]]
checksum_length = 13 if len(data) < 95 else 15
if not ms32_verify_checksum(data):
return (None, None, None, None, None)
return (hrp, k, identifier, share_index, data[:-checksum_length])
def decode(hrp, codex_str):
"""Decode a codex32 string."""
hrpgot, k, identifier, share_index, data = ms32_decode(codex_str)
if hrpgot != hrp:
return (None, None, None, None)
decoded = convertbits(data[6:], 5, 8, False)
if decoded is None or len(decoded) < 16 or len(decoded) > 64:
return (None, None, None, None)
if k == "1":
return (None, None, None, None)
return (k, identifier, share_index, decoded)
A way to recover master_seed
without a temporary identifier
is to directly ms32_recover()
the share payloads converted to lists of base32 ints rather than first create shares and then decode()
those to recover. This might be more elegant, I can use ms32_payload_list += [convertbits(share_payload[i], 8, 5, False)]
to convert bytes to lists of base 32 ints.
Would ms32_recover(ms32_list)
work without the identifier in the list? This may be more elegant, do you agree?
But an advantage to keeping it encode(payload[i])
is if we make the padding a bch binary checksum the new generated shares will get it as padding is handled by encode()
.
If the user provides an
identifier
parameter oftemp
that will be the finalidentifier
becauserelabel
is assigned False.
Yes, and despite having a different identifier, the share data will be the same.
Decode would fail with
identifier = 1111
due to:
It should fail because what you're hashing here shouldn't be interpretable as valid share.
A way to recover
master_seed
without a temporaryidentifier
is to directlyms32_recover()
the share payloads converted to lists of base32 ints rather than first create shares and thendecode()
those to recover. This might be more elegant, I can usems32_payload_list += [convertbits(share_payload[i], 8, 5, False)]
to convert bytes to lists of base 32 ints.
Yes, I think that would be much easier to follow. As written, you create a temporary identifier, stick this into info
, stick that into an HMAC, use that to get share payloads, then recover a master seed from them. It took me a while to confirm that the "temporary" identifier was indeed determining all of the data that was being produced, and doing so in a way that would collide with non-temporary identifiers.
But instead I'd just split up these cases, encode one of
Length:
+ hex-encoded 2-digit length with 0x prefix +ID:
+ identifier +Threshold:
+ k- Same as above, with "
ID:
+identifer" replaced withNO IDEN
.Then in both cases you have a string of the same length where all fields can be found at the same offsets and there is no way that one case can be interpreted as the other.
This is referring to this line?
info = str(seed_length)+'-byte share with header: '+header
Can you give an example info=
string for each case?
The problem isn't the HMAC, message=
is fine with variable length, the problem is decode()
needs a valid identifier
or it fails. I call decode() in my brute force error correction so it should fail on this too.
Can you give an example
info=
string for each case?
Length:0x20ID:TEMPThreshold:2
Length:0x10NO IDENThreshold:3
The problem isn't the HMAC,
message=
is fine with variable length, the problem isdecode()
needs a valididentifier
or it fails. I call decode() in my brute force error correction so it should fail on this too.
The code you posted doesn't call decode
anywhere that I see.
If the user provides an
identifier
parameter oftemp
that will be the finalidentifier
becauserelabel
is assigned False.Yes, and despite having a different identifier, the share data will be the same.
They will be different because the temporary identifier
isn't assigned until after derive_key
, identifier is still '' if it's not provided, then it's assigned temp
, vs if you passed temp
temp would be used in the scrypt KDF and give a different derived_key
derived_key = hashlib.scrypt(password=bytes(user_entropy + str(seed_length+k+identifier), "utf"),
salt=bytes(bitcoin_core_entropy, "utf"), n=2 ** 20, r=8, p=1, maxmem=1025 ** 3, dklen=64)
identifier = 'temp' if not identifier # place holder until we have better info
relabel = True if not identifier else False
They will be different because the temporary
identifier
isn't assigned until afterderive_key
, identifier is still '' if it's not provided, then it's assignedtemp
, vs if you passedtemp
temp would be used in the scrypt KDF and give a differentderived_key
Ah, yes, I see. I didn't really look at derived_key
. But now that I do, I think we should also change it to include (a) user_entropy
with a length prefix, seed_length
in a fixed-width format, and identifier
(or NO IDEN
) in some sort of fixed-width format.
I still thing we should avoid using an actual identifier for identifier
if none is provided.
TBH I have no idea what Python will do with str(seed_length + k + identifier)
if identifier
is None or if seed_length
is an integer. The text of a specification should clarify this.
Yes, I think that would be much easier to follow.
I will try master_seed = bytes(convertbits(ms32_recover(ms32_payload_list), 5, 8, False))
in a few hours.
I will also make a general function ms32_convert_bytes()
that converts from bytes to ms32_payloads that adds the padding checksum. Then that can be used in encode()
as well as constructing the ms32_payload_list
and converting bip32 fingerprints or the share payload KDF hash into bech32 identifiers.
I'll also push the entire library I have to reference/python-codex32/src/lib.py
so you can look at it and get oriented or make progress while I'm afk.
PBKDF.
I'd use the full shares as salt, not just the payload. In particular you want to make sure
k
is somehow in there so that the total length is unambiguous.
We only have full shares with identifier = 'temp'
at this point. This indent block is for finding the unique ID of the payload set. So all compatible sets will always get the same id.
Are you suggesting using the placeholder?
ALSO, getting the default unique identifier
for a distinct set of k shares should be in its own function as this is not the only place we'll need it. (For an existing master seed needs it as well if parameter reshare = True
.)
(Currently I think you can get the same salt if you have
k = 4
and a 128-bit secret as ifk = 2
and a 256-bit secret. Or at least, you need to depend on the security of the HMAC to guarantee this doesn't happen, which is a more complicated argument than just having the length be unambiguous.)
That's a good point, those shouldn't give the same identifier
because they're incompatible sets.
Here's what you define as a set that can be combined:
In order to recover a master seed, one needs a set of valid codex32 shares such that:
All shares have the same threshold value, the same identifier, and the same length. All of the share index values are distinct. The number of codex32 shares is exactly equal to the (common) threshold value.
Here's what we have:
A known threshold value, an unknown identifier, a known length Known indexes, I made these always be first letters as they must be consistent to give the same identifier for a combinable set. A number of codex32 payloads exactly equal to the (common) threshold value.
Let's write a function that takes k, seed_length, and a list of indexed payloads of len(indexed_payloads) == k and returns a unique bech32 identifier that will Always be identical if the indexed_payloads belong to the same order k polynomial and seed_length. Regardless of which indexed_payloads are passed. So it will first ms32_interpolate to get ms32_indexed_payload for index a, c, d ... etc until it has k many then convert to bytes to PBKDF2 them along with k and seed_length:
For example:
salt=bytes('Threshold: '+k+', Length: '+str(seed_len)+'-bytes, Payloads: ', 'utf')+b''.join(fixed_index_payloads)
It may be better that the first known index used is 's' followed by 'a', 'c', 'd' ... etc as we cannot set this identifier without knowing master_seed
because the data that goes into the hash defines a master_seed
and a particular combinable indexed_payload set of it. The codex32 secret is a codex32 string like any others so there's no reason this shouldn't also produce an ID for k = 0, a length, and the 's' indexed_payload master_seed
.
What do you think?
We only have full shares with identifier = 'temp' at this point. This indent block is for finding the unique ID of the payload set. So all compatible sets will always get the same id. Are you suggesting using the placeholder?
Ahh, right. I am tempted to use the placeholder, yeah.
In fact, I would define an "entropy header" up-front as the Length:0x20ID:TEMPThreshold:2
string (I'm not married to the exact syntax, but what I do want is (a) no ambiguity between the identifier/no identifier case, (b) all fixed width fields) and then use this repeatedly. Both to derive the share data and then again to salt the PBKDF.
Let's write a function that takes k, seed_length, and a list of indexed payloads of len(indexed_payloads) == k and returns a unique bech32 identifier that will Always be identical if the indexed_payloads belong to the same order k polynomial and seed_length.
Ah, yep, this sounds like a good abstraction. I think it should also take an optional identifier
so that it can compute and use the "entropy header".
So it will first ms32_interpolate to get ms32_indexed_payload for index a, c, d ... etc until it has k many then convert to bytes to PBKDF2 them along with k and seed_length:
Nice. I was thinking "just use s
" but you're correct that we need to encode k
shares. I'm indifferent as to whether we should start with s
or not. Probably that's a good idea since it emphasizes that knowing this data requires that you know s
, and maybe encourages implementors to be a little more careful with it.
Can you give an example
info=
string for each case?
Length:0x20ID:TEMPThreshold:2
Length:0x10NO IDENThreshold:3
Needs share_index
or all payloads will be identical.
On fixed length:
We learned the other day, it's password
in KDF and key
in HMAC that are zero padded (or pre-hashed) to the hash function block size. The salt
is passed as the message
and can be variable length with no issue.
Did you choose Hex for the length because other languages are more likely to have the length available as hex? Seed_length is an int in my function and putting 'Length: 16', 'Length: 32', 'Length: 64' conveys the same info with less characters.
Here's a counter-proposal:
info = 'Share Length:'+str(seed_length)+'Threshold:'+k+'Index:'+share_index+'Indentifier:'+indentifier
The problem isn't the HMAC,
message=
is fine with variable length, the problem isdecode()
needs a valididentifier
or it fails. I call decode() in my brute force error correction so it should fail on this too.The code you posted doesn't call
decode
anywhere that I see.
encode()
calls decode()
to self-validate, sorry, I copied pieter's segwit_addr.py
as closely as possible when I wrote this.
def encode(hrp, k, identifier, share_index, payload):
"""Encode a codex32 string"""
if share_index.lower() == 's': # add double sha256 hash byte to pad seeds
checksum = hashlib.sha256(hashlib.sha256(payload).digest()).digest()
else:
checksum = b'' # TODO: use a reed solomon or bch binary ECCcode for padding.
data = convertbits(payload + checksum, 8, 5, False)[:len(convertbits(payload, 8, 5))]
ret = ms32_encode(hrp, [CHARSET.find(x.lower()) for x in k + identifier + share_index] + data)
if decode(hrp, ret) == (None, None, None, None):
return None
return ret
Needs
share_index
or all payloads will be identical.
Ah, yep.
On fixed length: We learned the other day, it's
password
in KDF andkey
in HMAC that are zero padded (or pre-hashed) to the hash function block size. Thesalt
is passed as themessage
and can be variable length with no issue.
Fixed length is not about padding. It's to ensure that the data fed into a hash function can be uniquely parsed, without any subtle or fragile analysis.
Did you choose Hex for the length because other languages are more likely to have the length available as hex? Seed_length is an int in my function and putting 'Length: 16', 'Length: 32', 'Length: 64' conveys the same info with less characters.
I chose hex because it's more familiar and easy to user about than decimal. We could drop the 0x
prefix. But I guess if our lengths are always between 16 and 64 inclusive, decimal is just as easy to use.
Here's a counter-proposal:
Sounds good! Except identifier
is spelled wrong :).
encode()
callsdecode()
to self-validate, sorry, I copied pieter'ssegwit_addr.py
as closely as possible when I wrote this.
This makes sense -- but in general, if something is not intended to be a share, I really think that it should not parse as a share.
Ah, yes, I see. I didn't really look at
derived_key
. But now that I do, I think we should also change it to include (a)user_entropy
with a length prefix,seed_length
in a fixed-width format, andidentifier
(orNO IDEN
) in some sort of fixed-width format.
If the user entropy is provided as a utf string, the length prefixing is unnecessary, 'password0' derives a different key from 'password00' and anything else they could type. If you think bytes should be able to be passed to this for some reason (what human speaks bytes?) then it may need a length prefix if it has the same padding/pre-hashing "feature/bug" as PBKDF2.
I'm quite fine with a long user_entropy
getting pre-hashed (if scrypt indeed does this), in 99% of circumstances the bitcoin_core_entropy
will be more random, unique and worth fully preserving, but in the worst case, the master xprv can be deserialized to swap password and salt without losing any information from either.
seed_length
is already passed to scrypt, in that typo you found below:
password=bytes(user_entropy + str(seed_length) + k + identifier), "utf")
I still thing we should avoid using an actual identifier for
identifier
if none is provided.TBH I have no idea what Python will do with
str(seed_length + k + identifier)
ifidentifier
is None or ifseed_length
is an integer. The text of a specification should clarify this.
If the user entropy is provided as a utf string, the length prefixing is unnecessary, 'password0' derives a different key from 'password00' and anything else they could type. If you think bytes should be able to be passed to this for some reason (what human speaks bytes?) then it may need a length prefix if it has the same padding/pre-hashing "feature/bug" as PBKDF2.
We should have a length prefix. Otherwise the parse-unambiguity of the data that we are parsing depends on the fixed-lengthedness of our other data, which is subtle and fragile. I also do not think the ambiguity should depend on any assumptions about user behavior.
We could also pre-hash the data, which would have the same effect.
We should have a length prefix. Otherwise the parse-unambiguity of the data that we are parsing depends on the fixed-lengthedness of our other data, which is subtle and fragile. I also do not think the ambiguity should depend on any assumptions about user behavior.
You would never "parse" password
, it is simply the user's keyboard mashing or coinflips or dice rolls, it is forgotten after deriving derive_key
, unless someone is auditing their wallet, they'll need to re-enter it on a new wallet/device.
scrypt Parameters
The scrypt function takes several parameters. The passphrase P is typically a human-chosen password. The salt is normally uniquely and randomly generated [RFC4086].
If the parameters are unintuitive, it's harder to audit. Password is designed to accept human-chosen data.
We could also pre-hash the data, which would have the same effect.
https://datatracker.ietf.org/doc/html/rfc7914.html#section-6
scrypt behaves as pbkdf2 does, it will prehash passwords longer than the block length of HMAC-SHA256, which I think is 64-bytes. And it will pad shorter ones.
It's true you would never actually parse it. But if you're putting it into a hash function, it needs to be parseable. This is a general rule of thumb for designing cryptosystems, to ensure that it is impossible for two different things to be encoded the same way (and therefore result in the same hash).
In fact, I would define an "entropy header" up-front as the Length:0x20ID:TEMPThreshold:2 string (I'm not married to the exact syntax, but what I do want is (a) no ambiguity between the identifier/no identifier case, (b) all fixed width fields) and then use this repeatedly. Both to derive the share data and then again to salt the PBKDF.
Agreed, we should set this local variable up front and reuse it and/or pass it to the proposed get_identifier() function. Sure it can be fixed length by filling something of equal length without an identifier.
Nice. I was thinking "just use s" but you're correct that we need to encode k shares. I'm indifferent as to whether we should start with s or not. Probably that's a good idea since it emphasizes that knowing this data requires that you know s, and maybe encourages implementors to be a little more careful with it.
We're going to use 's' as the first share_index because it makes existing master seed easier. I'm also sick of parsing the sorted alphabet, so I'll be using CHARSET[i] to get the initial indexes for generating the set or computing the default identifier.
Sounds good! Except identifier is spelled wrong :).
I do this several times a day. Do you have a suggested alternative variable name or just keep spell checking. (Implementers are going to typo it too if using a new language.) id
or ident
i've considered but I always changed them back.
This makes sense -- but in general, if something is not intended to be a share, I really think that it should not parse as a share.
That's why we need a function that does general bytes to ms32 converting. I'll make it soon as it gets used at least 4 different places: fingerprint>bech32 id, unique combinable share set hash > bech32 id, indexed_payload > ms32_indexed_payload for ms32_recover(), and master_seed > ms32_indexed_payload for ms32_interpolate()
We're going to use 's' as the first share_index because it makes existing master seed easier. I'm also sick of parsing the sorted alphabet, so I'll be using CHARSET[i] to get the initial indexes for generating the set or computing the default identifier.
Sounds good to me.
I do this several times a day. Do you have a suggested alternative variable name or just keep spell checking. (Implementers are going to typo it too if using a new language.) id or ident i've considered but I always changed them back.
We could just misspell it, like referer
in HTTP ;).
I would suggest ident
or id
. Either is fine by me, and both are much easier to spell. If you can't decide, let's go with ident
.
That's why we need a function that does general bytes to ms32 converting.
Yep, I think that's generally useful. rust-bech32 has such a function (well, an iterator adaptor). I'm not really sure what exists for Python.
I would suggest
ident
orid
. Either is fine by me, and both are much easier to spell. If you can't decide, let's go withident
.
I will refactor to 'ident'. Saves characters, some of the lines exceed 100 for a single operation. I had not written those lines when I rejected it previously.
That's why we need a function that does general bytes to ms32 converting.
Yep, I think that's generally useful. rust-bech32 has such a function (well, an iterator adaptor). I'm not really sure what exists for Python.
convertbits(your_bytes_to_convert, 8, 5)
does it. It doesn't care if it's given bytes or lists of ints 0-255.
I modified it from the segwit_addr.py
version to accept any padding as valid rather than only 0s.
But we need to take it to the next step by always applying ECC coding as padding if pad=True
and tobits=5
.
def convertbits(data, frombits, tobits, pad=True):
"""General power-of-2 base conversion."""
acc = 0
bits = 0
ret = []
maxv = (1 << tobits) - 1
max_acc = (1 << (frombits + tobits - 1)) - 1
for value in data:
if value < 0 or (value >> frombits):
return None
acc = ((acc << frombits) | value) & max_acc
bits += frombits
while bits >= tobits:
bits -= tobits
ret.append((acc >> bits) & maxv)
if pad:
if bits:
ret.append((acc << (tobits - bits)) & maxv)
elif bits >= frombits:
return None
return ret
Regarding the padding:
The pad_len = 1 bit case can truncate the 2-bit code.
I'd rather @apoelstra try to do this, otherwise I'm likely to miss quality 1, 2 or both. I think 1 has something to do with codex32 generator and padding generator not sharing coefficients? But as for 2, ms32_interpolate()ing the 2, 3 or 4 bit code and getting another valid code, I'm lost besides the phrase "linear codes" and would have to trial and error it.
The 1st and 2nd bit together should guarantee detecting 1 bit-error, 2 if sequential, in 128-bit data.
This is easy; you can, for example, just xor every bit together for bit 1, and xor every other bit for bit 2.
The 3rd bit in combination with the first two should guarantee detecting 2 bit-errors, 3 if sequential and correcting 1 bit-error in 512-bit data.
I don't think this is possible because it would violate the Hamming bound (which is unlimited if you are trying to get distance 1 or 2, i.e. detect 1 error but correct none, but for distance >2 implies that for length 2^(2^n) you need at least n parity bits ... so for length >512 you need 9 parity bits).
Similarly for your goals with the 4th bit.
I think 1 has something to do with codex32 generator and padding generator not sharing coefficients?
Well, the two codes are over different fields so there's no way they can "share coefficients" and really no way to algebraically relate them at all. By the Hamming bound again I definitely can't make the 2-bit code purely additive like you want. The best we can do is make it independent of the original code (which my simple parity code is) so that if you have a random error pattern that somehow tricks the real code, with probability 3/4 it won't trick the small code.
To confirm about correction:
3-bits cannot guarantee to correct a bit error in 512-bits?
4-bits cannot guarantee to correct a bit error in 256-bits?
If so then, what’s the most errors 3-bits can detect in 512 and 4-bits can detect in 256?
If there are design trade offs, I think we should optimize for burst error detection.
Detecting 2 random 1-bit errors is less useful than detecting 3 or 4 sequential ones. Since the codex32 error correction will make burst errors from the binary perspective when it overcorrects. Or the user will if they mark an erasure or substitute non-visually-similar characters.
On Sat, Aug 12, 2023 at 10:37, Andrew Poelstra @.***(mailto:On Sat, Aug 12, 2023 at 10:37, Andrew Poelstra < wrote:
The 1st and 2nd bit together should guarantee detecting 1 bit-error, 2 if sequential, in 128-bit data.
This is easy; you can, for example, just xor every bit together for bit 1, and xor every other bit for bit 2.
The 3rd bit in combination with the first two should guarantee detecting 2 bit-errors, 3 if sequential and correcting 1 bit-error in 512-bit data.
I don't think this is possible because it would violate the Hamming bound (which is unlimited if you are trying to get distance 1 or 2, i.e. detect 1 error but correct none, but for distance >2 implies that for length 2^(2^n) you need at least n parity bits ... so for length >512 you need 9 parity bits).
Similarly for your goals with the 4th bit.
I think 1 has something to do with codex32 generator and padding generator not sharing coefficients?
Well, the two codes are over different fields so there's no way they can "share coefficients" and really no way to algebraically relate them at all. By the Hamming bound again I definitely can't make the 2-bit code purely additive like you want. The best we can do is make it independent of the original code (which my simple parity code is) so that if you have a random error pattern that somehow tricks the real code, with probability 3/4 it won't trick the small code.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I tried your formula and looks like 4-bits should be able to correct a bit-error in 256-bit data.
The 3-bit code just needs to detect as many contiguous 3 bit-errors as possible.
By the Hamming bound again I definitely can't make the 2-bit code purely additive like you want.
meaning ms32_interpolate() will produce shares with invalid ecc padding?
Even the parity bit is not additive? In this way? (I'm going to start experimenting when I get home unless you're certain.)
I tried your formula and looks like 4-bits should be able to correct a bit-error in 256-bit data.
The actual formula is the Hamming bound. For binary codes q = 2, and n is the total length of your data including checksum (so 260). To correct a single error you need d = 3, so t = 1, and the formula simplifies to give a total code size of 2^260 / (1 + 260) = 2^251.97. Unfortunately not enough to fit 256 bits of data.
The approximate formula I gave was that for 256 = 2^(2^8) you would need 8 bits to get distance 3, so even the approximate one suggests that this won't work.
The 3-bit code just needs to detect as many contiguous 3 bit-errors as possible.
I think the simplest approach is:
This approach is guaranteed to detect 1 error (which is the best the Hamming bound will let us do, for low numbers of bits) and to detect n consecutive errors.
ChatGPT explained like I was five that not even the parity bit stays valid after ms32_interpolate().
The problem with adding a checksum that won't interpolate is attacker can guess which shares are derived and which were generated better than chance. That's why I added double_sha256 bits only to master_seed, not every encode()'ed index.
So replacing that on "s" with your parity suggestion is an improvement if what I want (interpolate into more valid binary codes) is impossible.
And in your opinion would you apply the parity to all k generated shares, k-1 generated shares and the codex secret or Only the codex32 secret?
If shares they only help the user if they have that subset, while every recovery will need the secret so it's good to be certain it's valid code there.
How would they know they had an valid ECC share? Seems like the generated share indexes have to be fixed: a, c, d, up to k or k-1 have valid codes, then the remaining indexes are shuffled (as those are the ones that actually reveal the size of the backup, you can infer a exists given b, from the threshold 2.
I'm not sure. I agree that if it's not preserved by ms32_interpolate, then this is much less useful and also a privacy loss. Will need to do some experiments since I'm not sure analytically how to approach this. Parity checks are definitely additively homomorphic (if you add two shares they're preserved) but I'm unsure how they behave under bech32 multiplication.
Unfortunately it looks like no, in general, this sort of parity check is not preserved by interpolation. Damn.
In fact -- even if you set these bits to fixed 0, after ms32_interpolate, they might not be 0!
I checked the BIP and it looks like we (correctly) say to choose these at random, but I had never considered this to be particularly important until now, and I think in my personal shares I may have set these bits to 0... (though for my personal shares I also generated them starting from S, A, C, D, in order, so I already don't have index privacy or privacy against initial/derived discriminators).
In fact -- even if you set these bits to fixed 0, after ms32_interpolate, they might not be 0!
There's one last thing to try, xor every 5th bit and see what happens. That should give the exact same result as adding every symbol and truncating so maybe IT will be ms32_interpolate()-able.
Or maybe better, xor the 4th bit of every symbol to pick the 1st bit of padding and the 5th bit of every symbol for the 2nd. 128-bit case. That way we're sampling in the same period of the higher field it might work. This way last partially filled symbol doesn't get included.
zeros do enhance the error correction of the codex32 checksum but only for the last character of generated strings. That's kind of flimsy when they could detect errors anywhere (and enhance chances of error correction more often).
Leaves 9 choices:
Type of padding:
Where we should pad:
My original implementation generated len(master_seed)+1 bytes randomly for share payloads and then truncated. If generated shares are padded with coding they need fixed indexes so the user knows they have correction help, this has no impact on redundancy degree privacy. The derived shares still need deterministically random indexes.
If k-of-(k+1) or smaller backups are most common, k shares is best.
If n > k, then the best choice between s & k - 1 shares vs k shares can be calculated by number of sets in permutations(k,n) that have superior ecc. I don't think that's complex math so the function could choose based on k and n what to do. But choosing leaks a bit of redundancy privacy since the likelihood of finding early alphabet shares is k/n vs (k-1)/n and which of those it is suggests a lower or upper bound for n, since it was a function of k vs perfectly private odds if all shares are generated with random padding and indexes.
Basically I'm suggesting calculating the parity symbol in GF(32) excluding the padding symbol, then truncate the first bits to append it as padding and see if that will interpolate and remain a parity symbol for every share.
And then this will bring up question 1. again is it orthogonal to the codex32 checksum.
Intuitively this should work, since you're just adding the symbols in gf32 and "truncating" which is actually taking the value mod x^2
which is linear ... except that these bits aren't sitting isolated outside of the string. They're embedded inside of the last character, where they algebraically interact with the other bits of that character. So I don't think it works.
That is, we're not saying (x1 + x2 + x3) mod x^2 = Y
, we're saying (x1 + x2 + x3 mod x^2) + x4 = Y
. The former would be compatible with interpolation, the latter won't be.
When considering among these:
- k generated shares
- k-1 generated shares and the codex secret
- Only the codex32 secret
I realize from pure error detection ability, 1 & 2 are strictly superior, regardless which shares the user wants to recover with. Because they represent k * pad_len total checksum bits as opposed to 2-4. Here the non-linearity works to our advantage as threshold shares will always be able to interpolate back to index 16, 0, ... k - 1 and attempt corrections to make all k of these codes valid before checking an address.
Index 16 "s" is padded first so the unshared secret can also benefit from ecc padding.
Finally, by fixing the index of the ecc generated shares, that doesn't have to have any privacy impact at all (besides sometimes revealing this scheme is in use, which is fine if the standard) if the share set we output is randomly shuffled from the 31 total.
It's not necessary to grab the share with or without this padding in order to use it once you have k shares and can interpolate to the k padded strings.
These strings are all independent so still no error correction guarantees, but my instinct tells me triple the bits for k=3 gives 1/3 the chance of not detecting an error pattern.
The actual formula is the Hamming bound. For binary codes q = 2, and n is the total length of your data including checksum (so 260). To correct a single error you need d = 3, so t = 1, and the formula simplifies to give a total code size of 2^260 / (1 + 260) = 2^251.97. Unfortunately not enough to fit 256 bits of data.
By this formula k=4 and 128-bit shares and k=3 and 256-bit shares may be able to correct one bit error if the other k-1 strings are valid and correct (reasonable assumption if the codex32 checksum is valid).
Look at this absolute masterpiece:
def fingerprint_identifier(payload_list):
from electrum.bip32 import BIP32Node
from electrum.crypto import hash_160
for data in payload_list:
root_node = BIP32Node.from_rootseed(data, xtype="stanadrd")
pubkeys += root_node.eckey.get_public_key_bytes(compressed=True)
return ''.join([CHARSET[d] for d in convertbits(hash_160(pubkeys), 8, 5)])[:4]
Where payload list is each payload in bytes at index 16, then 0, ... k - 1.
For k = 0
or reshare=False
, [master_seed]
is passed to payload_list
and it returns the same fingerprint as prepended in its own descriptors. :) For reshare=True
it forms a unique combinable share set ident
regardless of seed_length and k thanks to the prehashing. And it's expensive and can help further error detect the k generated shares from above.
Design decision fork:
Should re-shares require new app_entropy
and user_entropy
to create a fresh derived_key
or be determined by master_seed
and header
alone?
master_seed
and header
alone means:
master_seed
has just 2^23 possible corresponding share sets (all thresholds)master_seed
rather than being independent unrecoverable secrets of their ownident
ident
becomes a nonce to determine share payloads, vs generated payloads determine the ident
abovemaster_seed
+ unique ident
vs hash_160(pubkeys)
is best effort basisident
when re-sharing, breaking securityident
rather than avoided by generating fresh entropy for the new sharesIf we go header
and master_seed
alone:
probably rename ident
to unique_id
or unique_ident
to emphasize the criticality it never be reused for the same master_seed
unless deliberately trying to add shares to an existing backup.
If we're okay with a risk of share set collisions when the user doesn't know all previously used identifiers to guarantee a unique ident
then fresh_master_seed()
could simplify to this:
def fresh_master_seed(bitcoin_core_entropy, user_entropy = '', seed_length = 16, k = '2', ident = ''):
"""Derive a fresh master seed of seed length bytes with optional user-provided entropy."""
import hashlib
if 16 > seed_length > 64:
return None
if 9 < k < 2:
return None
master_seed = hashlib.scrypt(password=bytes(user_entropy + str(seed_length+k+ident), "utf"),
salt=bytes(bitcoin_core_entropy, "utf"), n=2 ** 20, r=8, p=1, maxmem=1025 ** 3, dklen=seed_length)
if not ident:
ident = fingerprint_ident([master_seed])
return encode('ms', k, ident, 's', master_seed)
Really, it's fresh_codex32_secret()
but it could also return (master_seed, ident)
to keep the current name accurate.
It's not terrible if fresh entropy creates an ident
that matches an existing one (1% chance after 146 sets) because they have independent payloads and an incorrect recovered secret should fail the hash_160(pubkeys)
last resort error detection but it is bad (as discussed earlier), if an ident
nonce is accidentally reused to re-share a master_seed
, which has the same odds as above.
In the event a previous ident
used is not known, the user has to generate fresh entropy anyway (to minimize collision odds). So using header
as nonce is really only advantageous when all are known. Which is always true the first re-share.
Because of the small identifier that was not intended to be used as a nonce (or it'd be 32 or 64-bits) this cannot be reduced to insignificant odds. ≈ 2 * 10^-6 is the chance of 2 identifier's colliding when selected randomly. There's a reason why the bech32 checksum is 6 characters. If everyone on earth made a codex32 backup and re-shared an existing master_seed
twice, all forgetting the original ident
used (having a default may help extend this to a 3rd reshare since we can assume the fresh_codex32_secret()
used bip32 fingerprint) 16,000 of them are going to accidentally reshare the same share set even though they did everything correctly (except not forgetting old ident
). If the identifier was increased to 6 characters this could be 16 self-victims.
I'm also open to a combination approach where it uses master_seed
and header
if the user can affirm the new ident
is absolutely unique but requires fresh entropy and makes a new derived_key
when they cannot make give such assurance. This minimizes the risk of colliding share sets by having 2 ^ ((k - 1) * seed_length) possible re-share sets instead of just 2^20 while also minimizing labor and complexity when the user is absolutely sure they picked a unique id, which is true at least the first re-share and sometimes the 2nd.
ChatGPT informs me appending a unix timestamp to the header
before generating new shares for an existing master_seed
will prevent share payload collisions without needing to ask the user for fresh entropy. The time used should be displayed unconditionally for auditing purposes though. Malicious implementations could reuse a previous timestamp so it's better to display a human readable date time. Precision down to the minute is enough as no one will write and confirm multiple re-shares per minute. Or maybe just the date is enough and has better odds of being recalled if there's a need to recreate an existing re-share set.
Make the call here and I'll code for_an_existing_master_seed()
.
By this formula k=4 and 128-bit shares and k=3 and 256-bit shares may be able to correct one bit error if the other k-1 strings are valid and correct (reasonable assumption if the codex32 checksum is valid).
This error correction code would only work for a single blessed share set, not arbitrary sets of k. And it would be visible to a privacy attacker which share set was blessed. And it's still not guaranteed that the error we can correct will be distinct from the errors that the full codex32 code can correct. TBH I think this line of reasoning is just too much complexity for too little benefit and we should just use the bits unused.
Look at this absolute masterpiece:
Heh, the overlap in the k=0
case with BIP32 is pretty nice. And it is easy to implement in terms of off-the-shelf BIP32 methods, even though the bare algorithm feels a little overcomplicated.
Should re-shares require new
app_entropy
anduser_entropy
to create a freshderived_key
or be determined bymaster_seed
andheader
alone?
Great question. Thanks for the very thorough post. I am leaning toward the "stick an external counter/timestamp/weak randomness in there" approach. Requiring user_entropy
is too much burden on the user (as is requiring the user to keep track of uniqueness, if we were considering that) but I think it's fine to require the app to generate a unique token. I think our specific recommendation should be: either (a) a monotonic counter, or (b) at least 12 bytes (96 bits) of fresh randomness. The UNIX timestamp isn't a great choice because it only changes every second, and it won't be accessible to most HWWs.
Regarding error detection abilities: my feeling is that resharing will be fairly uncommon. So I think there's a reasonable amount of value to having error detection in the initial sharing but greatly diminished value in subsequent sharings. Remember that the checksum itself provides overwhelming protection against random errors; the fingerprint detection gives us the additional ability to detect malicious errors, which is itself a fringe/uncommon case. And of course, we always have this ability if we're willing to derive full addresses.
So basically, if we use the bip32 fingerprint for the initial sharing, this gives us a bunch of nice benefits (cheap/local error detection mainly) at basically no cost. But for re-sharing, the costs become nontrivial ... and especially, there is a risk of ident
reuse leading to shareset reuse, which is probable enough that it'll happen to a nonzero number of users.
As a further, point, even with the master_seed
/header
scheme, the error detection is not great. We no longer have the BIP32 fingerprint to compare against. Instead we have a weird hash of the share data itself, which is not something the wallet can produce on demand if the user has a single share they want to check on. Plus it's extra implementation complexity, reduced auditability, etc. So the benefit isn't even as big.
(I had a much longer reply suggesting the opposite, which took me half an hour to write but I decided that I disagreed with it :). This is quite a time-consuming exercise..)
This error correction code would only work for a single blessed share set, not arbitrary sets of k. And it would be visible to a privacy attacker which share set was blessed. And it's still not guaranteed that the error we can correct will be distinct from the errors that the full codex32 code can correct. TBH I think this line of reasoning is just too much complexity for too little benefit and we should just use the bits unused.
This will be my last question on the padding shares topic:
If a secret S, share A, C have ecc padding, the threshold is 3 and the user finds D, E, F but F has an uncorrectable errors, can he ms32_interpolate()
using the candidate share F over-corrections to secret S, share A and share C and then check if their ECC padding is valid to save work?
Look at this absolute masterpiece:
Heh, the overlap in the
k=0
case with BIP32 is pretty nice. And it is easy to implement in terms of off-the-shelf BIP32 methods, even though the bare algorithm feels a little overcomplicated.
I could've concatenated them as seeds to use def calc_fingerprint_of_this_node()
which is a method in every bip32 library but is ambiguous for different seed_length
. It doesn't matter as ident
is to differentiate incombinable share sets and a length diff does that for free.
I have an even better idea for re-share ident
however: encrypt the bip32 fingerprint by the birthdate/timestamp.
Should re-shares require new
app_entropy
anduser_entropy
to create a freshderived_key
or be determined bymaster_seed
andheader
alone?I am leaning toward the "stick an external counter/timestamp/weak randomness in there" approach.
I coded the unique salt for the derived_key as follows:
# date used to create a unique nonce if user forgets any past identifiers
date = datetime.date.today().strftime("%Y%m%d") if forgot else '19700101'
# must be displayed unconditionally for auditing purposes if forgot=True
fingerprint = fingerprint_ident([master_seed])
salt = codex32_secret[:9] + fingerprint + date
derived_key = hashlib.pbkdf2_hmac('sha512', password=master_seed, salt=salt,
iterations=2048, dklen=64)
I used the 8 digit year date with zero padding to maximize the chances someone would be able to recover the salt. While if someone forgets an identifier within 24 hours and reuses it they don't deserve security.
However, I can imagine someone performing this operation more than once in a day so it might need precision down to the hour or minute, although that makes it harder to regenerate this particular set in the future.
I'd appreciate your feedback on the precision of the timestamp (trade off between repeat generation ease and avoiding nonce reuse) and whether the timestamp should just by default (adds extra info to remember/grind to regenerate and audit, but would cover the "forgot I forgot worst case scenario") and eliminate the forgot
parameter.
I also changed fingerprint_ident()
to not truncate so it adds the whole hash_160 to the salt, a nice slowdown if master_seed
is partially leaked and being brute forced against known shares, for free.
Requiring
user_entropy
is too much burden on the user (as is requiring the user to keep track of uniqueness, if we were considering that) but I think it's fine to require the app to generate a unique token. I think our specific recommendation should be: either (a) a monotonic counter, or (b) at least 12 bytes (96 bits) of fresh randomness. The UNIX timestamp isn't a great choice because it only changes every second, and it won't be accessible to most HWWs.
Who can write and confirm 2 codex32 share backups in 1 second? If HWWs don't have a clock, why don't we just ask the user for the date (or other unique string they've never typed before, avoids trusting the hardware to not leak data by chosen nonce.) And if the hardware has a clock, display the date used unconditionally so it can be audited.
So I think there's a reasonable amount of value to having error detection in the initial sharing but greatly diminished value in subsequent sharings.
I think we can get it in both now with encryption, by simply remembering or recovering the nonce used on resharing.
I wrote the existing_master_seed()
so the master_seed provides all the entropy, even for re-shares. It's just simpler this way, using app entropy + user entropy and deriving shares, index_seed
and master_seed
from that can never regenerate a share set without a threshold of that particular share set's shares, vs master_seed
as the "seed" of the share set allows adding shares to other re-share sets without having to access a threshold of the set, just knowing their threshold&identifier is enough, and possibly now, their birthdate.
(which we can brute force out of the new re-share identifier, with a known bip32 fingerprint maybe, if you don't make it too precise!)
So basically, if we use the bip32 fingerprint for the initial sharing, this gives us a bunch of nice benefits (cheap/local error detection mainly) at basically no cost. But for re-sharing, the costs become nontrivial ... and especially, there is a risk of
ident
reuse leading to share set reuse, which is probable enough that it'll happen to a nonzero number of users.As a further, point, even with the
master_seed
/header
scheme, the error detection is not great. We no longer have the BIP32 fingerprint to compare against. Instead we have a weird hash of the share data itself, which is not something the wallet can produce on demand if the user has a single share they want to check on. Plus it's extra implementation complexity, reduced auditability, etc. So the benefit isn't even as big.
Why don't we encrypt the bip32 fingerprint by the timestamp (unique new entropy used for the reshare). Two things the wallet/user can probably can produce on demand (or brute force).
That way we get a random ident
and don't lose the error detection the fingerprint provides if we know the birth date.
I think this is a MUCH better idea than that convoluted concatenation of share payloads, and because this "encrypted" fingerprint feeds into the derived_key
derivation, it still shuffles the indexes and generates new randomness for payloads just fine.
date = datetime.date.today().strftime("%Y%m%d") if forgot else '19700101'
# must be displayed unconditionally for auditing purposes if forgot=True
fingerprint = fingerprint_ident([master_seed])
salt = codex32_secret[:9] + fingerprint + date
derived_key = hashlib.pbkdf2_hmac('sha512', password=master_seed, salt=salt,
iterations=2048, dklen=64)
if reshare:
encryption_key = hashlib.pbkdf2_hmac('sha512', password=date, salt=salt,
iterations=2048, dklen=3)
new_ident = old_ident ^ encryption_key
Optionally, the hash_160(master_pubkey) may be dropped from salt as the wallet might not have access to it. Then the identifier can be decrypted / error detected with the correct fingerprint, the header used to generate the reshare and the birthdate.
If a secret S, share A, C have ecc padding, the threshold is 3 and the user finds D, E, F but F has an uncorrectable errors, can he ms32_interpolate() using the candidate share F over-corrections to secret S, share A and share C and then check if their ECC padding is valid to save work?
I'm not sure what this means but I don't think so.
I used the 8 digit year date with zero padding to maximize the chances someone would be able to recover the salt. While if someone forgets an identifier within 24 hours and reuses it they don't deserve security.
They don't need to forget it to reuse it. They need to fail to actively override their wallet's default ident
computation, because they are aware of this "24 hour" rule that doesn't exist in any other scheme. I think this is too much of a burden on users. And yes, users will want to do this more than once per day (likely more than once per second, if their hardware can compute fast enough).
You also still have the problem that HWWs don't know the time, even to 24-hour precision.
Who can write and confirm 2 codex32 share backups in 1 second?
You don't need to write and confirm them before you start generating the next set. You might generate all the sets you need and then write/confirm them all.
If HWWs don't have a clock, why don't we just ask the user for the date (or other unique string they've never typed before, avoids trusting the hardware to not leak data by chosen nonce.) And if the hardware has a clock, display the date used unconditionally so it can be audited.
Because this is extra work for the user and has a nonobvious application (we are asking for a current date, which seems like an unnecessary and privacy-reducing thing to answer honestly, but actually we just want a unique string). And we can accomplish it by just generating random bytes.
I really don't like any of these date-based ideas.
They need to fail to actively override their wallet's default
ident
computation...
A wallet implementation should store the identifiers and bip32 fingerprints, ideally persistently. So that it never reuses them. Perhaps incrementing ident LSB or throwing error messages on collisions.
You also still have the problem that HWWs don't know the time, even to 24-hour precision.
Looks like HWW need (optional?) user provided uniqueness during reshares if all past IDs aren't stored/remembered.
It's the only auditable way. But the odds they'll input a unique string is higher than directly choosing a new ident.
You don't need to write and confirm them before you start generating the next set. You might generate all the sets you need and then write/confirm them all.
Without a persistent counter, complete record of previously used identifiers or a unique input, ident uniqueness and determinism for auditability can not be simultaneously guaranteed (to 1/2^10).
but actually we just want a unique string). And we can accomplish it by just generating random bytes.
Agree completely it should be "unique string" and this is a strange hassle to protect users from forgetfulness. But on the other hand, it's less work than for a fresh master seed. 20-bits of unique user entropy fully protects the reshare vs 128-bits for fresh seeds.
It's unsafe to let a wallet alone generate a random nonce for reshares, it could grind for ones that make your shares and/or the ident leak parts of the secret.
If the random bytes are generated by KDF/HMAC(app_entropy, user_unique_string) with app entropy unconditionally displayed before the user types
This is totally fine, the string "42" or "we" may be unique enough for the purpose.
Anything never typed before is unique to 1/2^10, and perfect uniqueness is only achieved by storing and/or remembering every ident.
I really don't like any of these date-based ideas.
Can date, all we need is a unique input.
A wallet implementation should store the identifiers and bip32 fingerprints, ideally persistently
Then these would need to be backed up along with the seed data?
If we just used 96 random bits there would be no need for anyone to store anything, persistently or otherwise.
Without a persistent counter, complete record of previously used identifiers or unique input, ident uniqueness and determinism for auditability can not be simultaneously guaranteed (1/2^10).
Right. We should just drop auditability for re-sharing. It's not a common case and all the proposed solutions seem to have serious usability problems.
If the random bytes are generated by HMAC(app_entropy, user_unique_string) with app entropy unconditionally displayed before the user types
This seems good to me. Though I do think we should query for "a unique string" rather than trying to give it any other semantic content. And maybe suggest the user roll dice ten times to get the unique string, or something. If it's too short/predictable the wallet could theoretically still grind.
Then these would need to be backed up along with the seed data?
In a digital backup, yes. On paper or metal, no.
The list helps prevent ident reuse but it's not essential data, treat it like wallet metadata (ex: prevents address reuse)
it's okay for ident to collide, they're likely to every 2^10 sets generated. It's only bad if payloads collide.
I edited that to mention the app_entropy and unique_user_string should be hardened by PBKDF2 to slow grinding.
How about encrypting (still using PBKDF2) the bip32 fingerprint by the unique string and using cipher text as new_ident?
Then we don't lose the error correction and other benefits as long as user remembers the unique string.
How about encrypting (still using PBKDF2) the bip32 fingerprint by the unique string and using cipher text as new_ident?
This is a neat idea. Yeah, let's do it. No need to use fancy encryption; just xor with the first 20 bits is fine.
This is a neat idea. Yeah, let's do it. No need to use fancy encryption; just xor with the first 20 bits is fine.
XOR is good for encryption algorithm but "unique string" shouldn't be recoverable if someone knows bip32 fingerprint and the ciphertext so PBKDF2("unique string") is needed for key derivation.
Non-zero chance people would enter names, birthdates, street numbers, PINs as a "unique string" and give a thief extra help.
Share payloads, unlike new_ident would incorporate both app entropy and unique_string so those will never collide (if RNG Honest) even if ident collides.
Yep, agreed -- we'd xor with the (truncated) output of the PBKDF2.
While intended for paper computers, some benefits of codex32 can be maximized through electronic implementations. To enhance this aspect, I propose recommendations for creating auditable codex32 share sets when using electronics.
Inspired by the concept of auditable bitcoin wallets, I have developed an implementation of For a fresh master seed, which ensures full determinism in generating the codex32 share set.
I believe that deterministically choosing every aspect of the codex32 share set enhances protection against potential malicious tampering of shares by detecting any deviation from the determinism.
In issue #54, we discussed that the
hash160(public_masterkey)
from descriptors serves as a useful deterministic data to use as the defaultidentifier
. However, we need to address another aspect that could potentially leak information – the share indices and payloads of the k - 1 independent shares.Currently, the BIP recommends selecting the next available letter from the bech32 alphabet, in alphabetical order, as share indices:
This approach may inadvertently reduce privacy, as finding share f, g, or h would reveal a backup of at least 5, 6, or 7 shares, respectively. To improve this, I propose an update to the BIP where, during hand creation, dice are rolled to select each new share index. This method is akin to how indexes are created in software like
libgfshare-bin
.For electronic implementations, share indices should be determined by the codex32 secret alone. Below is an example of how I chose the share indices in Python:
The
master_seed
,threshold
,identifier
, andseed_length
are the only variables in a codex32 secret, so I HMAC these together to seed the selection of share indices. Although a hash of the codex32 secret string would also work.I think this is the right track but I'm looking to standardize recommendations for electronic implementations. If there's a better way please say so.
Below is an example of how I generated the share payloads in Python:
While this a start, it may not be suitable as a standard recommendation due to scrypt parameters that won't run on hardware wallets. To address this, I am open to exploring alternative variable length secure hash functions. Your input on this matter would be appreciated.
Next, this stretched entropy is divided evenly into
k - 1
share payloads. Additionally, an extra byte is allocated for filling the padding bits on the 26th or 52th characters.Subsequently, the remaining
n + 1 - k
shares are derived according to the BIP:Lastly, if a new set of shares is created for an existing
master_seed
with the samethreshold
, I propose incrementing the last character of theidentifier
by 1, which will cause a completely new set of indicies and share payloads to be chosen.I welcome any feedback or suggestions for improving the approach. Let's establish standardized and auditable recommendations for electronic codex32 implementations. Thank you for your collaboration.