BlockstreamResearch / codex32

A paper computer for Shamir's Secret Sharing over the Bech32 alphabet.
79 stars 23 forks source link

Recommendations for Auditable Electronic Codex32 Implementations #57

Open BenWestgate opened 1 year ago

BenWestgate commented 1 year ago

While intended for paper computers, some benefits of codex32 can be maximized through electronic implementations. To enhance this aspect, I propose recommendations for creating auditable codex32 share sets when using electronics.

Inspired by the concept of auditable bitcoin wallets, I have developed an implementation of For a fresh master seed, which ensures full determinism in generating the codex32 share set.

I believe that deterministically choosing every aspect of the codex32 share set enhances protection against potential malicious tampering of shares by detecting any deviation from the determinism.

In issue #54, we discussed that the hash160(public_masterkey) from descriptors serves as a useful deterministic data to use as the default identifier. However, we need to address another aspect that could potentially leak information – the share indices and payloads of the k - 1 independent shares.

Currently, the BIP recommends selecting the next available letter from the bech32 alphabet, in alphabetical order, as share indices:

Take the next available letter from the bech32 alphabet, in alphabetical order, as a, c, d, ..., to be the share index

This approach may inadvertently reduce privacy, as finding share f, g, or h would reveal a backup of at least 5, 6, or 7 shares, respectively. To improve this, I propose an update to the BIP where, during hand creation, dice are rolled to select each new share index. This method is akin to how indexes are created in software like libgfshare-bin.

For electronic implementations, share indices should be determined by the codex32 secret alone. Below is an example of how I chose the share indices in Python:

import random
import hashlib
import hmac
salt=threshold+identifier+str(len(master_seed))
random.seed(a=hmac.digest(master_seed, salt, 'sha512'))
share_index = random.sample(indices_free, n)

The master_seed, threshold, identifier, and seed_length are the only variables in a codex32 secret, so I HMAC these together to seed the selection of share indices. Although a hash of the codex32 secret string would also work.

I think this is the right track but I'm looking to standardize recommendations for electronic implementations. If there's a better way please say so.

Below is an example of how I generated the share payloads in Python:

share_entropy = hashlib.scrypt(password=master_seed, salt=salt, n=2**20, r=8, p=1, maxmem=1025**3, dklen=(int(k)-1)*(seed_length+1))

While this a start, it may not be suitable as a standard recommendation due to scrypt parameters that won't run on hardware wallets. To address this, I am open to exploring alternative variable length secure hash functions. Your input on this matter would be appreciated.

Next, this stretched entropy is divided evenly into k - 1 share payloads. Additionally, an extra byte is allocated for filling the padding bits on the 26th or 52th characters.

Subsequently, the remaining n + 1 - k shares are derived according to the BIP:

    for x in range(int(k) - 1):
        data = list(share_entropy[x*(seed_length+1):x*(seed_length+1)+seed_length+1])
        share_list += [encode("ms", seed_length, k, identifier, share_index[x], data)]
    for x in range(int(k) - 1, n):
        derived_share_list += [derive_new_share(share_list,share_index[x])]

Lastly, if a new set of shares is created for an existing master_seed with the same threshold, I propose incrementing the last character of the identifier by 1, which will cause a completely new set of indicies and share payloads to be chosen.

I welcome any feedback or suggestions for improving the approach. Let's establish standardized and auditable recommendations for electronic codex32 implementations. Thank you for your collaboration.

BenWestgate commented 1 year ago

https://github.com/BenWestgate/codex32/tree/master/reference/python-codex32/src

BenWestgate commented 1 year ago

Okay, I solved a great problem:

password=master_seed indeed has the flaw that zero padding it without changing the header won't change the derived key.

Which is a problem because it changes the master xprv (since bip32 has them reversed).

While len(master_seed) in salt was my original quick fix. And perhaps it still belongs since it's data revealed in the shares. What would be even better is adding the master pubkey fingerprint after the codex32 secret header.

salt=header||fingerprint

A salt is supposed to include a pseudorandom component anyway. And this one is expensive.

Alternatively, salt=master_seed password=header avoids the flaw.

Salts aren't supposed to be reused (they would be here when re-sharing a secret).

Adding len(master_seed) to salt doesn't fix its reuse across different passwords (master_seeds).

Adding Fingerprint is better. Salt will always be unique per unique master_seed OR header.

apoelstra commented 1 year ago

Wow, this padding in HMAC is a huge footgun. Good catch. I agree with you then that we should only use fixed/non-secret data for the "key" and put our actual key data into "data".

It also sounds like you have good justification from RFCs for all of the choices you're making, which is great because this has gotten complicated enough that it's quite hard to vet.

This is generalizing the function from turning 1 codex32 string into a deterministic share set, to turning any codex32_strings_provided <= k into an auditable deterministically generated and shuffled share set.

Innteresting. So, there is a danger here that the user can abuse this to create multiple sets that share some shares (but not others). I don't mind supporting this but we should have some pretty strong warnings saying not to do this, and if they want to have the same person participate in multiple sharings, they should give that person multiple shares.

There is also a danger that users would do this accidentally; they generate some shares from s as expected, then later they don't have s but do have a bunch of shares, so they try to regenerate the original shares from that. Or maybe they do have s, but also have a, and think "I've got this extra data, might as well enter it right?" and then accidentally generate new shares instead of regenerating old ones.

BenWestgate commented 1 year ago

Wow, this padding in HMAC is a huge footgun. Good catch. I agree with you then that we should only use fixed/non-secret data for the "key" and put our actual key data into "data".

Putting len(master_seed) and a bip32 fingerprint into the salt also solves the password=master_seed 0 padding issue. And ticks more RFC compliance boxes (pseudorandom salt component, salt can be public, salt is never reused across different passwords, password remains secret.).

Innteresting. So, there is a danger here that the user can abuse this to create multiple sets that share some shares (but not others).

Yeah I agree, changing the default identifier could avoid this danger.

It's not the master_seed that defines whether shares are compatible but rather the threshold independent codex32 strings used to interpolate the rest of the backup.

If only the codex32 secret is passed, identifier can default to use the bip32 fingerprint.

If shares are passed, a hash of k shares data (ie data of share a, b and c if k = 3) is a more powerful ID for preventing improper share combinations.

This creates a circular reference if the identifier is in salt and determines the share data for a master_seed. But then the share data provided needs to determine the identifier.

I'll have to think about this more. But a hash of k shares data is definitely the safest default ID for backups made by specifying shares and their data.

I don't mind supporting this but we should have some pretty strong warnings saying not to do this, and if they want to have the same person participate in multiple sharings, they should give that person multiple shares.

Absolutely, and if the default output involves "re-labeling" the data provided, then the need to make a new share set will be obvious. It's the identifier (salt?) reuse that is dangerous.

There is also a danger that users would do this accidentally; they generate some shares from s as expected, then later they don't have s but do have a bunch of shares, so they try to regenerate the original shares from that.

If they have k shares they can already recover everything.

The function is called generate_NEW_shares(), it's not for recovering the original set, we already have that in Python and Rust with ms32_interpolate.

You shouldn't be able to generate a new share set without providing the codex32 secret. If you pass k shares you can calculate the secret but have no degrees of freedom to create any new data, so it'd just set the correct identifier and deterministic shuffle for displaying the dependent shares.

Or maybe they do have s, but also have a, and think "I've got this extra data, might as well enter it right?" and then accidentally generate new shares instead of regenerating old ones.

Titling the functions:

Recover Old Shares Generate New Backup

Is very important to prevent these mistakes, but if you only pass codex32 string data from an existing backup it made, it ought to produce the exact same backup, as it's deterministic. Unless you override the identifier. Then it uses the provided data and fills in the rest. (That's what it's doing in the first case too but you provided identical data as it was going to generate had you not provided it.)

Only if you give it share data that's not from the default output of a given master_seed and header will it make something unique. (And relabel it)

This will make more sense if I try to code it first.

BenWestgate commented 1 year ago

Here's my first stab at a reworked Generating Shares, For a fresh master seed:

def fresh_master_seed(bitcoin_core_entropy, user_entropy = '', seed_length = 16, k = '2', identifier = ''):
    """Derive a fresh master seed of seed length bytes with optional user-provided entropy."""
    import hmac
    import hashlib
    share_list = []
    share_payload = []
    if 16 > seed_length > 64:
        return None
    if 9 < k < 2:
        return None
    letters = sorted(filter(lambda c: c.isalpha(), CHARSET))
    derived_key = hashlib.scrypt(password=bytes(user_entropy + str(seed_length+k+identifier), "utf"),
                                 salt=bytes(bitcoin_core_entropy, "utf"), n=2 ** 20, r=8, p=1, maxmem=1025 ** 3, dklen=64)
    identifier = 'temp' if not identifier # place holder until we have better info
    relabel = True if not identifier else False

    for i in range(int(k)):
        # generate random initial shares
        share_index = letters[i]
        header = 'ms1'+ k + identifier + share_index
        info = str(seed_length)+'-byte share with header: '+header
        share_payload[i] = hmac.new(key=derived_key, message=info, hashlib.sha512).digest()[:seed_length]
        share_list += [encode('ms1', k, identifier, share_index, share_payload[i])]

    master_seed = recover_master_seed(share_list)
    if relabel:
        salt = b''.join(share_payload)
        identifier_data = hashlib.pbkdf2_hmac('sha512', password = master_seed, salt = salt,
                                             iterations = 2048, dklen = 3)
        new_identifier =  ''.join([CHARSET[d] for d in convertbits(identifier_data, 8, 5)[:4]])
        share_list = relabel_shares('ms1', share_list, new_identifier)
        return master_seed, share_list

I've not implemented shuffle yet so it returns master_seed and the k random shares generated at indices a, c, ... etc.

This default identifier was chosen to give incompatible share sets unique IDs, even if they belong to the same master_seed. I think bip32 master pubkey fingerprint is more useful (for fresh master seeds, NOT re-shares) I just didn't know which bip32 library to use or to attempt to do it myself with python's cryptography library.

This identifier is derived by pbkdf2 using BIP39's cost parameters so that valid shares can't be checked for correctness against the identifier faster than against an address.

If you like this, I'll make it the default for re-shares when identifier is not specified and suggest to use fingerprint for reshare=False (when user has an existing bip32 master_seed or extended private masterkey and wants to encode it in a codex32 share set.)

apoelstra commented 1 year ago

Identifier.

If you want to use a temporary identifier like this, I'd use 1111 rather than temp. The current code, I believe, will produce the same share data for a user-provided identifier of temp as it will for a non-user-provided identifier.

But instead I'd just split up these cases, encode one of

Then in both cases you have a string of the same length where all fields can be found at the same offsets and there is no way that one case can be interpreted as the other.

PBKDF.

I'd use the full shares as salt, not just the payload. In particular you want to make sure k is somehow in there so that the total length is unambiguous. (Currently I think you can get the same salt if you have k = 4 and a 128-bit secret as if k = 2 and a 256-bit secret. Or at least, you need to depend on the security of the HMAC to guarantee this doesn't happen, which is a more complicated argument than just having the length be unambiguous.)

Other than that, this looks good to me!

apoelstra commented 1 year ago

Re moving the index from position 6 to position 1, @roconnor-blockstream argues on IRC that

  1. I'm going to push back and disagree the index is the most important. If you are assembling shares, and the first 5 data characters dont' agree (8 characters if you include the prefix) then you are doing something wrong. It is helpful to have all the constant characters all together and at the front.
  2. Because all the constant characters are at the front, it means the first few columns of your checksum worksheet are also going to be constant. That can save a tiny amount of work.

Personally I think (2) is a big deal -- when doing the checksum worksheet by hand, it's refreshing to get a "head start" on the worksheet by having the first several lines all give the same characters. It's faster, it reassures you that you're doing the right thing (or at least, a consistent thing).

Also there is the argument that we already printed a bunch of codex32 books and can't change it :).

BenWestgate commented 1 year ago

Also there is the argument that we already printed a bunch of codex32 books and can't change it :).

Fair enough, the constants autocomplete anyways on electronic recovery.

BenWestgate commented 1 year ago

Identifier.

If you want to use a temporary identifier like this, I'd use 1111 rather than temp. The current code, I believe, will produce the same share data for a user-provided identifier of temp as it will for a non-user-provided identifier.

Decode would fail with identifier = 1111 due to:

    if not all(x in CHARSET for x in bech[pos + 1:]):
        return (None, None, None, None, None)

If the user provides an identifier parameter of temp that will be the final identifier because relabel is assigned False.

    identifier = 'temp' if not identifier
    relabel = True if not identifier else False

    if relabel:

Should the functions below not validate the charset?

def ms32_decode(bech):
    """Validate a MS32 string, and determine HRP and data."""
    if ((any(ord(x) < 33 or ord(x) > 126 for x in bech)) or
            (bech.lower() != bech and bech.upper() != bech)):
        return (None, None, None, None, None)
    bech = bech.lower()
    pos = bech.rfind('1')
    if pos < 1 or pos + 46 > len(bech):
        return (None, None, None, None, None)
    if not all(x in CHARSET for x in bech[pos + 1:]):
        return (None, None, None, None, None)
    hrp = bech[:pos]
    k = bech[pos + 1]
    if k == "1" or not k.isdigit():
        return (None, None, None, None, None)
    identifier = bech[pos + 2:pos + 6]
    share_index = bech[pos + 6]
    if k == "0" and share_index != "s":
        return (None, None, None, None, None)
    data = [CHARSET.find(x) for x in bech[pos + 1:]]
    checksum_length = 13 if len(data) < 95 else 15
    if not ms32_verify_checksum(data):
        return (None, None, None, None, None)
    return (hrp, k, identifier, share_index, data[:-checksum_length])

def decode(hrp, codex_str):
    """Decode a codex32 string."""
    hrpgot, k, identifier, share_index, data = ms32_decode(codex_str)
    if hrpgot != hrp:
        return (None, None, None, None)
    decoded = convertbits(data[6:], 5, 8, False)
    if decoded is None or len(decoded) < 16 or len(decoded) > 64:
        return (None, None, None, None)
    if k == "1":
        return (None, None, None, None)
    return (k, identifier, share_index, decoded)

A way to recover master_seed without a temporary identifier is to directly ms32_recover() the share payloads converted to lists of base32 ints rather than first create shares and then decode() those to recover. This might be more elegant, I can use ms32_payload_list += [convertbits(share_payload[i], 8, 5, False)] to convert bytes to lists of base 32 ints.

Would ms32_recover(ms32_list) work without the identifier in the list? This may be more elegant, do you agree? But an advantage to keeping it encode(payload[i]) is if we make the padding a bch binary checksum the new generated shares will get it as padding is handled by encode().

apoelstra commented 1 year ago

If the user provides an identifier parameter of temp that will be the final identifier because relabel is assigned False.

Yes, and despite having a different identifier, the share data will be the same.

Decode would fail with identifier = 1111 due to:

It should fail because what you're hashing here shouldn't be interpretable as valid share.

A way to recover master_seed without a temporary identifier is to directly ms32_recover() the share payloads converted to lists of base32 ints rather than first create shares and then decode() those to recover. This might be more elegant, I can use ms32_payload_list += [convertbits(share_payload[i], 8, 5, False)] to convert bytes to lists of base 32 ints.

Yes, I think that would be much easier to follow. As written, you create a temporary identifier, stick this into info, stick that into an HMAC, use that to get share payloads, then recover a master seed from them. It took me a while to confirm that the "temporary" identifier was indeed determining all of the data that was being produced, and doing so in a way that would collide with non-temporary identifiers.

BenWestgate commented 1 year ago

But instead I'd just split up these cases, encode one of

  • Length: + hex-encoded 2-digit length with 0x prefix + ID: + identifier + Threshold: + k
  • Same as above, with "ID:+identifer" replaced with NO IDEN.

Then in both cases you have a string of the same length where all fields can be found at the same offsets and there is no way that one case can be interpreted as the other.

This is referring to this line? info = str(seed_length)+'-byte share with header: '+header

Can you give an example info= string for each case?

The problem isn't the HMAC, message= is fine with variable length, the problem is decode() needs a valid identifier or it fails. I call decode() in my brute force error correction so it should fail on this too.

apoelstra commented 1 year ago

Can you give an example info= string for each case?

The problem isn't the HMAC, message= is fine with variable length, the problem is decode() needs a valid identifier or it fails. I call decode() in my brute force error correction so it should fail on this too.

The code you posted doesn't call decode anywhere that I see.

BenWestgate commented 1 year ago

If the user provides an identifier parameter of temp that will be the final identifier because relabel is assigned False.

Yes, and despite having a different identifier, the share data will be the same.

They will be different because the temporary identifier isn't assigned until after derive_key, identifier is still '' if it's not provided, then it's assigned temp, vs if you passed temp temp would be used in the scrypt KDF and give a different derived_key

    derived_key = hashlib.scrypt(password=bytes(user_entropy + str(seed_length+k+identifier), "utf"),
                                 salt=bytes(bitcoin_core_entropy, "utf"), n=2 ** 20, r=8, p=1, maxmem=1025 ** 3, dklen=64)
    identifier = 'temp' if not identifier # place holder until we have better info
    relabel = True if not identifier else False
apoelstra commented 1 year ago

They will be different because the temporary identifier isn't assigned until after derive_key, identifier is still '' if it's not provided, then it's assigned temp, vs if you passed temp temp would be used in the scrypt KDF and give a different derived_key

Ah, yes, I see. I didn't really look at derived_key. But now that I do, I think we should also change it to include (a) user_entropy with a length prefix, seed_length in a fixed-width format, and identifier (or NO IDEN) in some sort of fixed-width format.

I still thing we should avoid using an actual identifier for identifier if none is provided.

TBH I have no idea what Python will do with str(seed_length + k + identifier) if identifier is None or if seed_length is an integer. The text of a specification should clarify this.

BenWestgate commented 1 year ago

Yes, I think that would be much easier to follow.

I will try master_seed = bytes(convertbits(ms32_recover(ms32_payload_list), 5, 8, False)) in a few hours.

I will also make a general function ms32_convert_bytes() that converts from bytes to ms32_payloads that adds the padding checksum. Then that can be used in encode() as well as constructing the ms32_payload_list and converting bip32 fingerprints or the share payload KDF hash into bech32 identifiers.

I'll also push the entire library I have to reference/python-codex32/src/lib.py so you can look at it and get oriented or make progress while I'm afk.

BenWestgate commented 1 year ago

PBKDF.

I'd use the full shares as salt, not just the payload. In particular you want to make sure k is somehow in there so that the total length is unambiguous.

We only have full shares with identifier = 'temp' at this point. This indent block is for finding the unique ID of the payload set. So all compatible sets will always get the same id. Are you suggesting using the placeholder?

ALSO, getting the default unique identifier for a distinct set of k shares should be in its own function as this is not the only place we'll need it. (For an existing master seed needs it as well if parameter reshare = True.)

(Currently I think you can get the same salt if you have k = 4 and a 128-bit secret as if k = 2 and a 256-bit secret. Or at least, you need to depend on the security of the HMAC to guarantee this doesn't happen, which is a more complicated argument than just having the length be unambiguous.)

That's a good point, those shouldn't give the same identifier because they're incompatible sets.

Here's what you define as a set that can be combined:

In order to recover a master seed, one needs a set of valid codex32 shares such that:

All shares have the same threshold value, the same identifier, and the same length. All of the share index values are distinct. The number of codex32 shares is exactly equal to the (common) threshold value.

Here's what we have:

A known threshold value, an unknown identifier, a known length Known indexes, I made these always be first letters as they must be consistent to give the same identifier for a combinable set. A number of codex32 payloads exactly equal to the (common) threshold value.

Let's write a function that takes k, seed_length, and a list of indexed payloads of len(indexed_payloads) == k and returns a unique bech32 identifier that will Always be identical if the indexed_payloads belong to the same order k polynomial and seed_length. Regardless of which indexed_payloads are passed. So it will first ms32_interpolate to get ms32_indexed_payload for index a, c, d ... etc until it has k many then convert to bytes to PBKDF2 them along with k and seed_length:

For example:

salt=bytes('Threshold: '+k+', Length: '+str(seed_len)+'-bytes, Payloads: ', 'utf')+b''.join(fixed_index_payloads)

It may be better that the first known index used is 's' followed by 'a', 'c', 'd' ... etc as we cannot set this identifier without knowing master_seed because the data that goes into the hash defines a master_seed and a particular combinable indexed_payload set of it. The codex32 secret is a codex32 string like any others so there's no reason this shouldn't also produce an ID for k = 0, a length, and the 's' indexed_payload master_seed.

What do you think?

apoelstra commented 1 year ago

We only have full shares with identifier = 'temp' at this point. This indent block is for finding the unique ID of the payload set. So all compatible sets will always get the same id. Are you suggesting using the placeholder?

Ahh, right. I am tempted to use the placeholder, yeah.

In fact, I would define an "entropy header" up-front as the Length:0x20ID:TEMPThreshold:2 string (I'm not married to the exact syntax, but what I do want is (a) no ambiguity between the identifier/no identifier case, (b) all fixed width fields) and then use this repeatedly. Both to derive the share data and then again to salt the PBKDF.

Let's write a function that takes k, seed_length, and a list of indexed payloads of len(indexed_payloads) == k and returns a unique bech32 identifier that will Always be identical if the indexed_payloads belong to the same order k polynomial and seed_length.

Ah, yep, this sounds like a good abstraction. I think it should also take an optional identifier so that it can compute and use the "entropy header".

So it will first ms32_interpolate to get ms32_indexed_payload for index a, c, d ... etc until it has k many then convert to bytes to PBKDF2 them along with k and seed_length:

Nice. I was thinking "just use s" but you're correct that we need to encode k shares. I'm indifferent as to whether we should start with s or not. Probably that's a good idea since it emphasizes that knowing this data requires that you know s, and maybe encourages implementors to be a little more careful with it.

BenWestgate commented 1 year ago

Can you give an example info= string for each case?

  • Length:0x20ID:TEMPThreshold:2
  • Length:0x10NO IDENThreshold:3

Needs share_index or all payloads will be identical.

On fixed length: We learned the other day, it's password in KDF and key in HMAC that are zero padded (or pre-hashed) to the hash function block size. The salt is passed as the message and can be variable length with no issue.

Did you choose Hex for the length because other languages are more likely to have the length available as hex? Seed_length is an int in my function and putting 'Length: 16', 'Length: 32', 'Length: 64' conveys the same info with less characters.

Here's a counter-proposal:

info = 'Share Length:'+str(seed_length)+'Threshold:'+k+'Index:'+share_index+'Indentifier:'+indentifier

The problem isn't the HMAC, message= is fine with variable length, the problem is decode() needs a valid identifier or it fails. I call decode() in my brute force error correction so it should fail on this too.

The code you posted doesn't call decode anywhere that I see.

encode() calls decode() to self-validate, sorry, I copied pieter's segwit_addr.py as closely as possible when I wrote this.

def encode(hrp, k, identifier, share_index, payload):
    """Encode a codex32 string"""
    if share_index.lower() == 's':  # add double sha256 hash byte to pad seeds
        checksum = hashlib.sha256(hashlib.sha256(payload).digest()).digest()
    else:
        checksum = b'' # TODO: use a reed solomon or bch binary ECCcode for padding.
    data = convertbits(payload + checksum, 8, 5, False)[:len(convertbits(payload, 8, 5))]
    ret = ms32_encode(hrp, [CHARSET.find(x.lower()) for x in k + identifier + share_index] + data)
    if decode(hrp, ret) == (None, None, None, None):
        return None
    return ret
apoelstra commented 1 year ago

Needs share_index or all payloads will be identical.

Ah, yep.

On fixed length: We learned the other day, it's password in KDF and key in HMAC that are zero padded (or pre-hashed) to the hash function block size. The salt is passed as the message and can be variable length with no issue.

Fixed length is not about padding. It's to ensure that the data fed into a hash function can be uniquely parsed, without any subtle or fragile analysis.

Did you choose Hex for the length because other languages are more likely to have the length available as hex? Seed_length is an int in my function and putting 'Length: 16', 'Length: 32', 'Length: 64' conveys the same info with less characters.

I chose hex because it's more familiar and easy to user about than decimal. We could drop the 0x prefix. But I guess if our lengths are always between 16 and 64 inclusive, decimal is just as easy to use.

Here's a counter-proposal:

Sounds good! Except identifier is spelled wrong :).

encode() calls decode() to self-validate, sorry, I copied pieter's segwit_addr.py as closely as possible when I wrote this.

This makes sense -- but in general, if something is not intended to be a share, I really think that it should not parse as a share.

BenWestgate commented 1 year ago

Ah, yes, I see. I didn't really look at derived_key. But now that I do, I think we should also change it to include (a) user_entropy with a length prefix, seed_length in a fixed-width format, and identifier (or NO IDEN) in some sort of fixed-width format.

If the user entropy is provided as a utf string, the length prefixing is unnecessary, 'password0' derives a different key from 'password00' and anything else they could type. If you think bytes should be able to be passed to this for some reason (what human speaks bytes?) then it may need a length prefix if it has the same padding/pre-hashing "feature/bug" as PBKDF2.

I'm quite fine with a long user_entropy getting pre-hashed (if scrypt indeed does this), in 99% of circumstances the bitcoin_core_entropy will be more random, unique and worth fully preserving, but in the worst case, the master xprv can be deserialized to swap password and salt without losing any information from either.

seed_length is already passed to scrypt, in that typo you found below:

password=bytes(user_entropy + str(seed_length) + k + identifier), "utf")

I still thing we should avoid using an actual identifier for identifier if none is provided.

TBH I have no idea what Python will do with str(seed_length + k + identifier) if identifier is None or if seed_length is an integer. The text of a specification should clarify this.

apoelstra commented 1 year ago

If the user entropy is provided as a utf string, the length prefixing is unnecessary, 'password0' derives a different key from 'password00' and anything else they could type. If you think bytes should be able to be passed to this for some reason (what human speaks bytes?) then it may need a length prefix if it has the same padding/pre-hashing "feature/bug" as PBKDF2.

We should have a length prefix. Otherwise the parse-unambiguity of the data that we are parsing depends on the fixed-lengthedness of our other data, which is subtle and fragile. I also do not think the ambiguity should depend on any assumptions about user behavior.

We could also pre-hash the data, which would have the same effect.

BenWestgate commented 1 year ago

We should have a length prefix. Otherwise the parse-unambiguity of the data that we are parsing depends on the fixed-lengthedness of our other data, which is subtle and fragile. I also do not think the ambiguity should depend on any assumptions about user behavior.

You would never "parse" password, it is simply the user's keyboard mashing or coinflips or dice rolls, it is forgotten after deriving derive_key, unless someone is auditing their wallet, they'll need to re-enter it on a new wallet/device.

scrypt Parameters

The scrypt function takes several parameters. The passphrase P is typically a human-chosen password. The salt is normally uniquely and randomly generated [RFC4086].

If the parameters are unintuitive, it's harder to audit. Password is designed to accept human-chosen data.

We could also pre-hash the data, which would have the same effect.

https://datatracker.ietf.org/doc/html/rfc7914.html#section-6

scrypt behaves as pbkdf2 does, it will prehash passwords longer than the block length of HMAC-SHA256, which I think is 64-bytes. And it will pad shorter ones.

apoelstra commented 1 year ago

It's true you would never actually parse it. But if you're putting it into a hash function, it needs to be parseable. This is a general rule of thumb for designing cryptosystems, to ensure that it is impossible for two different things to be encoded the same way (and therefore result in the same hash).

BenWestgate commented 1 year ago

In fact, I would define an "entropy header" up-front as the Length:0x20ID:TEMPThreshold:2 string (I'm not married to the exact syntax, but what I do want is (a) no ambiguity between the identifier/no identifier case, (b) all fixed width fields) and then use this repeatedly. Both to derive the share data and then again to salt the PBKDF.

Agreed, we should set this local variable up front and reuse it and/or pass it to the proposed get_identifier() function. Sure it can be fixed length by filling something of equal length without an identifier.

Nice. I was thinking "just use s" but you're correct that we need to encode k shares. I'm indifferent as to whether we should start with s or not. Probably that's a good idea since it emphasizes that knowing this data requires that you know s, and maybe encourages implementors to be a little more careful with it.

We're going to use 's' as the first share_index because it makes existing master seed easier. I'm also sick of parsing the sorted alphabet, so I'll be using CHARSET[i] to get the initial indexes for generating the set or computing the default identifier.

Sounds good! Except identifier is spelled wrong :).

I do this several times a day. Do you have a suggested alternative variable name or just keep spell checking. (Implementers are going to typo it too if using a new language.) id or ident i've considered but I always changed them back.

This makes sense -- but in general, if something is not intended to be a share, I really think that it should not parse as a share.

That's why we need a function that does general bytes to ms32 converting. I'll make it soon as it gets used at least 4 different places: fingerprint>bech32 id, unique combinable share set hash > bech32 id, indexed_payload > ms32_indexed_payload for ms32_recover(), and master_seed > ms32_indexed_payload for ms32_interpolate()

apoelstra commented 1 year ago

We're going to use 's' as the first share_index because it makes existing master seed easier. I'm also sick of parsing the sorted alphabet, so I'll be using CHARSET[i] to get the initial indexes for generating the set or computing the default identifier.

Sounds good to me.

I do this several times a day. Do you have a suggested alternative variable name or just keep spell checking. (Implementers are going to typo it too if using a new language.) id or ident i've considered but I always changed them back.

We could just misspell it, like referer in HTTP ;).

I would suggest ident or id. Either is fine by me, and both are much easier to spell. If you can't decide, let's go with ident.

That's why we need a function that does general bytes to ms32 converting.

Yep, I think that's generally useful. rust-bech32 has such a function (well, an iterator adaptor). I'm not really sure what exists for Python.

BenWestgate commented 1 year ago

I would suggest ident or id. Either is fine by me, and both are much easier to spell. If you can't decide, let's go with ident.

I will refactor to 'ident'. Saves characters, some of the lines exceed 100 for a single operation. I had not written those lines when I rejected it previously.

That's why we need a function that does general bytes to ms32 converting.

Yep, I think that's generally useful. rust-bech32 has such a function (well, an iterator adaptor). I'm not really sure what exists for Python.

convertbits(your_bytes_to_convert, 8, 5) does it. It doesn't care if it's given bytes or lists of ints 0-255. I modified it from the segwit_addr.py version to accept any padding as valid rather than only 0s.

But we need to take it to the next step by always applying ECC coding as padding if pad=True and tobits=5.

def convertbits(data, frombits, tobits, pad=True):
    """General power-of-2 base conversion."""
    acc = 0
    bits = 0
    ret = []
    maxv = (1 << tobits) - 1
    max_acc = (1 << (frombits + tobits - 1)) - 1
    for value in data:
        if value < 0 or (value >> frombits):
            return None
        acc = ((acc << frombits) | value) & max_acc
        bits += frombits
        while bits >= tobits:
            bits -= tobits
            ret.append((acc >> bits) & maxv)
    if pad:
        if bits:
            ret.append((acc << (tobits - bits)) & maxv)
    elif bits >= frombits:
        return None
    return ret

Regarding the padding:

  1. All of these should be independent of the codex32 checksum to add to its correction capacity.
  2. All of these should ms32_interpolate() and remain valid checksums for the interpolated data. Linear codes???

The pad_len = 1 bit case can truncate the 2-bit code.

I'd rather @apoelstra try to do this, otherwise I'm likely to miss quality 1, 2 or both. I think 1 has something to do with codex32 generator and padding generator not sharing coefficients? But as for 2, ms32_interpolate()ing the 2, 3 or 4 bit code and getting another valid code, I'm lost besides the phrase "linear codes" and would have to trial and error it.

apoelstra commented 1 year ago

The 1st and 2nd bit together should guarantee detecting 1 bit-error, 2 if sequential, in 128-bit data.

This is easy; you can, for example, just xor every bit together for bit 1, and xor every other bit for bit 2.

The 3rd bit in combination with the first two should guarantee detecting 2 bit-errors, 3 if sequential and correcting 1 bit-error in 512-bit data.

I don't think this is possible because it would violate the Hamming bound (which is unlimited if you are trying to get distance 1 or 2, i.e. detect 1 error but correct none, but for distance >2 implies that for length 2^(2^n) you need at least n parity bits ... so for length >512 you need 9 parity bits).

Similarly for your goals with the 4th bit.

I think 1 has something to do with codex32 generator and padding generator not sharing coefficients?

Well, the two codes are over different fields so there's no way they can "share coefficients" and really no way to algebraically relate them at all. By the Hamming bound again I definitely can't make the 2-bit code purely additive like you want. The best we can do is make it independent of the original code (which my simple parity code is) so that if you have a random error pattern that somehow tricks the real code, with probability 3/4 it won't trick the small code.

BenWestgate commented 1 year ago

To confirm about correction:

3-bits cannot guarantee to correct a bit error in 512-bits?

4-bits cannot guarantee to correct a bit error in 256-bits?

If so then, what’s the most errors 3-bits can detect in 512 and 4-bits can detect in 256?

If there are design trade offs, I think we should optimize for burst error detection.

Detecting 2 random 1-bit errors is less useful than detecting 3 or 4 sequential ones. Since the codex32 error correction will make burst errors from the binary perspective when it overcorrects. Or the user will if they mark an erasure or substitute non-visually-similar characters.

On Sat, Aug 12, 2023 at 10:37, Andrew Poelstra @.***(mailto:On Sat, Aug 12, 2023 at 10:37, Andrew Poelstra < wrote:

The 1st and 2nd bit together should guarantee detecting 1 bit-error, 2 if sequential, in 128-bit data.

This is easy; you can, for example, just xor every bit together for bit 1, and xor every other bit for bit 2.

The 3rd bit in combination with the first two should guarantee detecting 2 bit-errors, 3 if sequential and correcting 1 bit-error in 512-bit data.

I don't think this is possible because it would violate the Hamming bound (which is unlimited if you are trying to get distance 1 or 2, i.e. detect 1 error but correct none, but for distance >2 implies that for length 2^(2^n) you need at least n parity bits ... so for length >512 you need 9 parity bits).

Similarly for your goals with the 4th bit.

I think 1 has something to do with codex32 generator and padding generator not sharing coefficients?

Well, the two codes are over different fields so there's no way they can "share coefficients" and really no way to algebraically relate them at all. By the Hamming bound again I definitely can't make the 2-bit code purely additive like you want. The best we can do is make it independent of the original code (which my simple parity code is) so that if you have a random error pattern that somehow tricks the real code, with probability 3/4 it won't trick the small code.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

BenWestgate commented 1 year ago

I tried your formula and looks like 4-bits should be able to correct a bit-error in 256-bit data.

The 3-bit code just needs to detect as many contiguous 3 bit-errors as possible.

By the Hamming bound again I definitely can't make the 2-bit code purely additive like you want.

meaning ms32_interpolate() will produce shares with invalid ecc padding?

Even the parity bit is not additive? In this way? (I'm going to start experimenting when I get home unless you're certain.)

apoelstra commented 1 year ago

I tried your formula and looks like 4-bits should be able to correct a bit-error in 256-bit data.

The actual formula is the Hamming bound. For binary codes q = 2, and n is the total length of your data including checksum (so 260). To correct a single error you need d = 3, so t = 1, and the formula simplifies to give a total code size of 2^260 / (1 + 260) = 2^251.97. Unfortunately not enough to fit 256 bits of data.

The approximate formula I gave was that for 256 = 2^(2^8) you would need 8 bits to get distance 3, so even the approximate one suggests that this won't work.

The 3-bit code just needs to detect as many contiguous 3 bit-errors as possible.

I think the simplest approach is:

This approach is guaranteed to detect 1 error (which is the best the Hamming bound will let us do, for low numbers of bits) and to detect n consecutive errors.

BenWestgate commented 1 year ago

ChatGPT explained like I was five that not even the parity bit stays valid after ms32_interpolate().

The problem with adding a checksum that won't interpolate is attacker can guess which shares are derived and which were generated better than chance. That's why I added double_sha256 bits only to master_seed, not every encode()'ed index.

So replacing that on "s" with your parity suggestion is an improvement if what I want (interpolate into more valid binary codes) is impossible.

And in your opinion would you apply the parity to all k generated shares, k-1 generated shares and the codex secret or Only the codex32 secret?

If shares they only help the user if they have that subset, while every recovery will need the secret so it's good to be certain it's valid code there.

How would they know they had an valid ECC share? Seems like the generated share indexes have to be fixed: a, c, d, up to k or k-1 have valid codes, then the remaining indexes are shuffled (as those are the ones that actually reveal the size of the backup, you can infer a exists given b, from the threshold 2.

apoelstra commented 1 year ago

I'm not sure. I agree that if it's not preserved by ms32_interpolate, then this is much less useful and also a privacy loss. Will need to do some experiments since I'm not sure analytically how to approach this. Parity checks are definitely additively homomorphic (if you add two shares they're preserved) but I'm unsure how they behave under bech32 multiplication.

apoelstra commented 1 year ago

Unfortunately it looks like no, in general, this sort of parity check is not preserved by interpolation. Damn.

apoelstra commented 1 year ago

In fact -- even if you set these bits to fixed 0, after ms32_interpolate, they might not be 0!

I checked the BIP and it looks like we (correctly) say to choose these at random, but I had never considered this to be particularly important until now, and I think in my personal shares I may have set these bits to 0... (though for my personal shares I also generated them starting from S, A, C, D, in order, so I already don't have index privacy or privacy against initial/derived discriminators).

BenWestgate commented 1 year ago

In fact -- even if you set these bits to fixed 0, after ms32_interpolate, they might not be 0!

There's one last thing to try, xor every 5th bit and see what happens. That should give the exact same result as adding every symbol and truncating so maybe IT will be ms32_interpolate()-able.

Or maybe better, xor the 4th bit of every symbol to pick the 1st bit of padding and the 5th bit of every symbol for the 2nd. 128-bit case. That way we're sampling in the same period of the higher field it might work. This way last partially filled symbol doesn't get included.

zeros do enhance the error correction of the codex32 checksum but only for the last character of generated strings. That's kind of flimsy when they could detect errors anywhere (and enhance chances of error correction more often).

Leaves 9 choices:

Type of padding:

  1. Zeros
  2. double_sha256
  3. parity bits

Where we should pad:

  1. k generated shares
  2. k-1 generated shares and the codex secret
  3. Only the codex32 secret

My original implementation generated len(master_seed)+1 bytes randomly for share payloads and then truncated. If generated shares are padded with coding they need fixed indexes so the user knows they have correction help, this has no impact on redundancy degree privacy. The derived shares still need deterministically random indexes.

If k-of-(k+1) or smaller backups are most common, k shares is best.

If n > k, then the best choice between s & k - 1 shares vs k shares can be calculated by number of sets in permutations(k,n) that have superior ecc. I don't think that's complex math so the function could choose based on k and n what to do. But choosing leaks a bit of redundancy privacy since the likelihood of finding early alphabet shares is k/n vs (k-1)/n and which of those it is suggests a lower or upper bound for n, since it was a function of k vs perfectly private odds if all shares are generated with random padding and indexes.

BenWestgate commented 1 year ago

Basically I'm suggesting calculating the parity symbol in GF(32) excluding the padding symbol, then truncate the first bits to append it as padding and see if that will interpolate and remain a parity symbol for every share.

And then this will bring up question 1. again is it orthogonal to the codex32 checksum.

apoelstra commented 1 year ago

Intuitively this should work, since you're just adding the symbols in gf32 and "truncating" which is actually taking the value mod x^2 which is linear ... except that these bits aren't sitting isolated outside of the string. They're embedded inside of the last character, where they algebraically interact with the other bits of that character. So I don't think it works.

That is, we're not saying (x1 + x2 + x3) mod x^2 = Y, we're saying (x1 + x2 + x3 mod x^2) + x4 = Y. The former would be compatible with interpolation, the latter won't be.

BenWestgate commented 1 year ago

When considering among these:

  1. k generated shares
  2. k-1 generated shares and the codex secret
  3. Only the codex32 secret

I realize from pure error detection ability, 1 & 2 are strictly superior, regardless which shares the user wants to recover with. Because they represent k * pad_len total checksum bits as opposed to 2-4. Here the non-linearity works to our advantage as threshold shares will always be able to interpolate back to index 16, 0, ... k - 1 and attempt corrections to make all k of these codes valid before checking an address.

Index 16 "s" is padded first so the unshared secret can also benefit from ecc padding.

Finally, by fixing the index of the ecc generated shares, that doesn't have to have any privacy impact at all (besides sometimes revealing this scheme is in use, which is fine if the standard) if the share set we output is randomly shuffled from the 31 total.

It's not necessary to grab the share with or without this padding in order to use it once you have k shares and can interpolate to the k padded strings.

These strings are all independent so still no error correction guarantees, but my instinct tells me triple the bits for k=3 gives 1/3 the chance of not detecting an error pattern.

The actual formula is the Hamming bound. For binary codes q = 2, and n is the total length of your data including checksum (so 260). To correct a single error you need d = 3, so t = 1, and the formula simplifies to give a total code size of 2^260 / (1 + 260) = 2^251.97. Unfortunately not enough to fit 256 bits of data.

By this formula k=4 and 128-bit shares and k=3 and 256-bit shares may be able to correct one bit error if the other k-1 strings are valid and correct (reasonable assumption if the codex32 checksum is valid).

BenWestgate commented 1 year ago

Look at this absolute masterpiece:

def fingerprint_identifier(payload_list):
    from electrum.bip32 import BIP32Node
    from electrum.crypto import hash_160
    for data in payload_list:
        root_node = BIP32Node.from_rootseed(data, xtype="stanadrd")
        pubkeys += root_node.eckey.get_public_key_bytes(compressed=True)

    return ''.join([CHARSET[d] for d in convertbits(hash_160(pubkeys), 8, 5)])[:4]

Where payload list is each payload in bytes at index 16, then 0, ... k - 1.

For k = 0 or reshare=False, [master_seed] is passed to payload_list and it returns the same fingerprint as prepended in its own descriptors. :) For reshare=True it forms a unique combinable share set ident regardless of seed_length and k thanks to the prehashing. And it's expensive and can help further error detect the k generated shares from above.

BenWestgate commented 1 year ago

Design decision fork:

Should re-shares require new app_entropy and user_entropy to create a fresh derived_key or be determined by master_seed and header alone?

master_seed and header alone means:

  1. a given master_seed has just 2^23 possible corresponding share sets (all thresholds)
  2. shares can be re-derived from master_seed rather than being independent unrecoverable secrets of their own
  3. requires hardening share payloads to be costlier to re-derive than addresses
  4. excellent error detection
  5. re-shares require no new entropy besides selecting a unique ident
  6. ident becomes a nonce to determine share payloads, vs generated payloads determine the ident above
  7. guarantees no collisions of incombinable sets for re-shared master_seed + unique ident vs hash_160(pubkeys) is best effort basis
  8. runs risk of accidentally re-using an ident when re-sharing, breaking security
  9. birthday problem is over 2^20 ident rather than avoided by generating fresh entropy for the new shares
  10. Easier to audit
  11. Less steps required to re-share

If we go header and master_seed alone: probably rename ident to unique_id or unique_ident to emphasize the criticality it never be reused for the same master_seed unless deliberately trying to add shares to an existing backup.

If we're okay with a risk of share set collisions when the user doesn't know all previously used identifiers to guarantee a unique ident then fresh_master_seed() could simplify to this:

def fresh_master_seed(bitcoin_core_entropy, user_entropy = '', seed_length = 16, k = '2', ident = ''):
    """Derive a fresh master seed of seed length bytes with optional user-provided entropy."""
    import hashlib
    if 16 > seed_length > 64:
        return None
    if 9 < k < 2:
        return None
    master_seed = hashlib.scrypt(password=bytes(user_entropy + str(seed_length+k+ident), "utf"),
                                 salt=bytes(bitcoin_core_entropy, "utf"), n=2 ** 20, r=8, p=1, maxmem=1025 ** 3, dklen=seed_length)
    if not ident:
        ident = fingerprint_ident([master_seed])
    return encode('ms', k, ident, 's', master_seed)

Really, it's fresh_codex32_secret() but it could also return (master_seed, ident) to keep the current name accurate.

It's not terrible if fresh entropy creates an ident that matches an existing one (1% chance after 146 sets) because they have independent payloads and an incorrect recovered secret should fail the hash_160(pubkeys) last resort error detection but it is bad (as discussed earlier), if an ident nonce is accidentally reused to re-share a master_seed, which has the same odds as above.

In the event a previous ident used is not known, the user has to generate fresh entropy anyway (to minimize collision odds). So using header as nonce is really only advantageous when all are known. Which is always true the first re-share.

Because of the small identifier that was not intended to be used as a nonce (or it'd be 32 or 64-bits) this cannot be reduced to insignificant odds. ≈ 2 * 10^-6 is the chance of 2 identifier's colliding when selected randomly. There's a reason why the bech32 checksum is 6 characters. If everyone on earth made a codex32 backup and re-shared an existing master_seed twice, all forgetting the original ident used (having a default may help extend this to a 3rd reshare since we can assume the fresh_codex32_secret() used bip32 fingerprint) 16,000 of them are going to accidentally reshare the same share set even though they did everything correctly (except not forgetting old ident). If the identifier was increased to 6 characters this could be 16 self-victims.

I'm also open to a combination approach where it uses master_seed and header if the user can affirm the new ident is absolutely unique but requires fresh entropy and makes a new derived_key when they cannot make give such assurance. This minimizes the risk of colliding share sets by having 2 ^ ((k - 1) * seed_length) possible re-share sets instead of just 2^20 while also minimizing labor and complexity when the user is absolutely sure they picked a unique id, which is true at least the first re-share and sometimes the 2nd.

ChatGPT informs me appending a unix timestamp to the header before generating new shares for an existing master_seed will prevent share payload collisions without needing to ask the user for fresh entropy. The time used should be displayed unconditionally for auditing purposes though. Malicious implementations could reuse a previous timestamp so it's better to display a human readable date time. Precision down to the minute is enough as no one will write and confirm multiple re-shares per minute. Or maybe just the date is enough and has better odds of being recalled if there's a need to recreate an existing re-share set.

Make the call here and I'll code for_an_existing_master_seed().

apoelstra commented 1 year ago

By this formula k=4 and 128-bit shares and k=3 and 256-bit shares may be able to correct one bit error if the other k-1 strings are valid and correct (reasonable assumption if the codex32 checksum is valid).

This error correction code would only work for a single blessed share set, not arbitrary sets of k. And it would be visible to a privacy attacker which share set was blessed. And it's still not guaranteed that the error we can correct will be distinct from the errors that the full codex32 code can correct. TBH I think this line of reasoning is just too much complexity for too little benefit and we should just use the bits unused.

Look at this absolute masterpiece:

Heh, the overlap in the k=0 case with BIP32 is pretty nice. And it is easy to implement in terms of off-the-shelf BIP32 methods, even though the bare algorithm feels a little overcomplicated.

Should re-shares require new app_entropy and user_entropy to create a fresh derived_key or be determined by master_seed and header alone?

Great question. Thanks for the very thorough post. I am leaning toward the "stick an external counter/timestamp/weak randomness in there" approach. Requiring user_entropy is too much burden on the user (as is requiring the user to keep track of uniqueness, if we were considering that) but I think it's fine to require the app to generate a unique token. I think our specific recommendation should be: either (a) a monotonic counter, or (b) at least 12 bytes (96 bits) of fresh randomness. The UNIX timestamp isn't a great choice because it only changes every second, and it won't be accessible to most HWWs.

Regarding error detection abilities: my feeling is that resharing will be fairly uncommon. So I think there's a reasonable amount of value to having error detection in the initial sharing but greatly diminished value in subsequent sharings. Remember that the checksum itself provides overwhelming protection against random errors; the fingerprint detection gives us the additional ability to detect malicious errors, which is itself a fringe/uncommon case. And of course, we always have this ability if we're willing to derive full addresses.

So basically, if we use the bip32 fingerprint for the initial sharing, this gives us a bunch of nice benefits (cheap/local error detection mainly) at basically no cost. But for re-sharing, the costs become nontrivial ... and especially, there is a risk of ident reuse leading to shareset reuse, which is probable enough that it'll happen to a nonzero number of users.

As a further, point, even with the master_seed/header scheme, the error detection is not great. We no longer have the BIP32 fingerprint to compare against. Instead we have a weird hash of the share data itself, which is not something the wallet can produce on demand if the user has a single share they want to check on. Plus it's extra implementation complexity, reduced auditability, etc. So the benefit isn't even as big.

(I had a much longer reply suggesting the opposite, which took me half an hour to write but I decided that I disagreed with it :). This is quite a time-consuming exercise..)

BenWestgate commented 1 year ago

This error correction code would only work for a single blessed share set, not arbitrary sets of k. And it would be visible to a privacy attacker which share set was blessed. And it's still not guaranteed that the error we can correct will be distinct from the errors that the full codex32 code can correct. TBH I think this line of reasoning is just too much complexity for too little benefit and we should just use the bits unused.

This will be my last question on the padding shares topic:

If a secret S, share A, C have ecc padding, the threshold is 3 and the user finds D, E, F but F has an uncorrectable errors, can he ms32_interpolate() using the candidate share F over-corrections to secret S, share A and share C and then check if their ECC padding is valid to save work?

Look at this absolute masterpiece:

Heh, the overlap in the k=0 case with BIP32 is pretty nice. And it is easy to implement in terms of off-the-shelf BIP32 methods, even though the bare algorithm feels a little overcomplicated.

I could've concatenated them as seeds to use def calc_fingerprint_of_this_node() which is a method in every bip32 library but is ambiguous for different seed_length. It doesn't matter as ident is to differentiate incombinable share sets and a length diff does that for free.

I have an even better idea for re-share ident however: encrypt the bip32 fingerprint by the birthdate/timestamp.

Should re-shares require new app_entropy and user_entropy to create a fresh derived_key or be determined by master_seed and header alone?

I am leaning toward the "stick an external counter/timestamp/weak randomness in there" approach.

I coded the unique salt for the derived_key as follows:

# date used to create a unique nonce if user forgets any past identifiers
date = datetime.date.today().strftime("%Y%m%d") if forgot else '19700101'
# must be displayed unconditionally for auditing purposes if forgot=True
fingerprint = fingerprint_ident([master_seed])
salt = codex32_secret[:9] + fingerprint + date
derived_key = hashlib.pbkdf2_hmac('sha512', password=master_seed, salt=salt,
                    iterations=2048, dklen=64)

I used the 8 digit year date with zero padding to maximize the chances someone would be able to recover the salt. While if someone forgets an identifier within 24 hours and reuses it they don't deserve security.

However, I can imagine someone performing this operation more than once in a day so it might need precision down to the hour or minute, although that makes it harder to regenerate this particular set in the future.

I'd appreciate your feedback on the precision of the timestamp (trade off between repeat generation ease and avoiding nonce reuse) and whether the timestamp should just by default (adds extra info to remember/grind to regenerate and audit, but would cover the "forgot I forgot worst case scenario") and eliminate the forgot parameter.

I also changed fingerprint_ident() to not truncate so it adds the whole hash_160 to the salt, a nice slowdown if master_seed is partially leaked and being brute forced against known shares, for free.

Requiring user_entropy is too much burden on the user (as is requiring the user to keep track of uniqueness, if we were considering that) but I think it's fine to require the app to generate a unique token. I think our specific recommendation should be: either (a) a monotonic counter, or (b) at least 12 bytes (96 bits) of fresh randomness. The UNIX timestamp isn't a great choice because it only changes every second, and it won't be accessible to most HWWs.

Who can write and confirm 2 codex32 share backups in 1 second? If HWWs don't have a clock, why don't we just ask the user for the date (or other unique string they've never typed before, avoids trusting the hardware to not leak data by chosen nonce.) And if the hardware has a clock, display the date used unconditionally so it can be audited.

So I think there's a reasonable amount of value to having error detection in the initial sharing but greatly diminished value in subsequent sharings.

I think we can get it in both now with encryption, by simply remembering or recovering the nonce used on resharing.

I wrote the existing_master_seed() so the master_seed provides all the entropy, even for re-shares. It's just simpler this way, using app entropy + user entropy and deriving shares, index_seed and master_seed from that can never regenerate a share set without a threshold of that particular share set's shares, vs master_seed as the "seed" of the share set allows adding shares to other re-share sets without having to access a threshold of the set, just knowing their threshold&identifier is enough, and possibly now, their birthdate.

(which we can brute force out of the new re-share identifier, with a known bip32 fingerprint maybe, if you don't make it too precise!)

So basically, if we use the bip32 fingerprint for the initial sharing, this gives us a bunch of nice benefits (cheap/local error detection mainly) at basically no cost. But for re-sharing, the costs become nontrivial ... and especially, there is a risk of ident reuse leading to share set reuse, which is probable enough that it'll happen to a nonzero number of users.

As a further, point, even with the master_seed/header scheme, the error detection is not great. We no longer have the BIP32 fingerprint to compare against. Instead we have a weird hash of the share data itself, which is not something the wallet can produce on demand if the user has a single share they want to check on. Plus it's extra implementation complexity, reduced auditability, etc. So the benefit isn't even as big.

Why don't we encrypt the bip32 fingerprint by the timestamp (unique new entropy used for the reshare). Two things the wallet/user can probably can produce on demand (or brute force).

That way we get a random ident and don't lose the error detection the fingerprint provides if we know the birth date.

I think this is a MUCH better idea than that convoluted concatenation of share payloads, and because this "encrypted" fingerprint feeds into the derived_key derivation, it still shuffles the indexes and generates new randomness for payloads just fine.

BenWestgate commented 1 year ago
    date = datetime.date.today().strftime("%Y%m%d") if forgot else '19700101'
    # must be displayed unconditionally for auditing purposes if forgot=True
    fingerprint = fingerprint_ident([master_seed])
    salt = codex32_secret[:9] + fingerprint + date
    derived_key = hashlib.pbkdf2_hmac('sha512', password=master_seed, salt=salt,
                        iterations=2048, dklen=64)
    if reshare:
        encryption_key = hashlib.pbkdf2_hmac('sha512', password=date, salt=salt,
                        iterations=2048, dklen=3)

        new_ident = old_ident ^ encryption_key

Optionally, the hash_160(master_pubkey) may be dropped from salt as the wallet might not have access to it. Then the identifier can be decrypted / error detected with the correct fingerprint, the header used to generate the reshare and the birthdate.

apoelstra commented 1 year ago

If a secret S, share A, C have ecc padding, the threshold is 3 and the user finds D, E, F but F has an uncorrectable errors, can he ms32_interpolate() using the candidate share F over-corrections to secret S, share A and share C and then check if their ECC padding is valid to save work?

I'm not sure what this means but I don't think so.

I used the 8 digit year date with zero padding to maximize the chances someone would be able to recover the salt. While if someone forgets an identifier within 24 hours and reuses it they don't deserve security.

They don't need to forget it to reuse it. They need to fail to actively override their wallet's default ident computation, because they are aware of this "24 hour" rule that doesn't exist in any other scheme. I think this is too much of a burden on users. And yes, users will want to do this more than once per day (likely more than once per second, if their hardware can compute fast enough).

You also still have the problem that HWWs don't know the time, even to 24-hour precision.

Who can write and confirm 2 codex32 share backups in 1 second?

You don't need to write and confirm them before you start generating the next set. You might generate all the sets you need and then write/confirm them all.

If HWWs don't have a clock, why don't we just ask the user for the date (or other unique string they've never typed before, avoids trusting the hardware to not leak data by chosen nonce.) And if the hardware has a clock, display the date used unconditionally so it can be audited.

Because this is extra work for the user and has a nonobvious application (we are asking for a current date, which seems like an unnecessary and privacy-reducing thing to answer honestly, but actually we just want a unique string). And we can accomplish it by just generating random bytes.

I really don't like any of these date-based ideas.

BenWestgate commented 1 year ago

They need to fail to actively override their wallet's default ident computation...

A wallet implementation should store the identifiers and bip32 fingerprints, ideally persistently. So that it never reuses them. Perhaps incrementing ident LSB or throwing error messages on collisions.

You also still have the problem that HWWs don't know the time, even to 24-hour precision.

Looks like HWW need (optional?) user provided uniqueness during reshares if all past IDs aren't stored/remembered.

It's the only auditable way. But the odds they'll input a unique string is higher than directly choosing a new ident.

You don't need to write and confirm them before you start generating the next set. You might generate all the sets you need and then write/confirm them all.

Without a persistent counter, complete record of previously used identifiers or a unique input, ident uniqueness and determinism for auditability can not be simultaneously guaranteed (to 1/2^10).

but actually we just want a unique string). And we can accomplish it by just generating random bytes.

Agree completely it should be "unique string" and this is a strange hassle to protect users from forgetfulness. But on the other hand, it's less work than for a fresh master seed. 20-bits of unique user entropy fully protects the reshare vs 128-bits for fresh seeds.

It's unsafe to let a wallet alone generate a random nonce for reshares, it could grind for ones that make your shares and/or the ident leak parts of the secret.

If the random bytes are generated by KDF/HMAC(app_entropy, user_unique_string) with app entropy unconditionally displayed before the user types

This is totally fine, the string "42" or "we" may be unique enough for the purpose.

Anything never typed before is unique to 1/2^10, and perfect uniqueness is only achieved by storing and/or remembering every ident.

I really don't like any of these date-based ideas.

Can date, all we need is a unique input.

apoelstra commented 1 year ago

A wallet implementation should store the identifiers and bip32 fingerprints, ideally persistently

Then these would need to be backed up along with the seed data?

If we just used 96 random bits there would be no need for anyone to store anything, persistently or otherwise.

Without a persistent counter, complete record of previously used identifiers or unique input, ident uniqueness and determinism for auditability can not be simultaneously guaranteed (1/2^10).

Right. We should just drop auditability for re-sharing. It's not a common case and all the proposed solutions seem to have serious usability problems.

If the random bytes are generated by HMAC(app_entropy, user_unique_string) with app entropy unconditionally displayed before the user types

This seems good to me. Though I do think we should query for "a unique string" rather than trying to give it any other semantic content. And maybe suggest the user roll dice ten times to get the unique string, or something. If it's too short/predictable the wallet could theoretically still grind.

BenWestgate commented 1 year ago

Then these would need to be backed up along with the seed data?

In a digital backup, yes. On paper or metal, no.

The list helps prevent ident reuse but it's not essential data, treat it like wallet metadata (ex: prevents address reuse)

it's okay for ident to collide, they're likely to every 2^10 sets generated. It's only bad if payloads collide.

I edited that to mention the app_entropy and unique_user_string should be hardened by PBKDF2 to slow grinding.

How about encrypting (still using PBKDF2) the bip32 fingerprint by the unique string and using cipher text as new_ident?

Then we don't lose the error correction and other benefits as long as user remembers the unique string.

apoelstra commented 1 year ago

How about encrypting (still using PBKDF2) the bip32 fingerprint by the unique string and using cipher text as new_ident?

This is a neat idea. Yeah, let's do it. No need to use fancy encryption; just xor with the first 20 bits is fine.

BenWestgate commented 1 year ago

This is a neat idea. Yeah, let's do it. No need to use fancy encryption; just xor with the first 20 bits is fine.

XOR is good for encryption algorithm but "unique string" shouldn't be recoverable if someone knows bip32 fingerprint and the ciphertext so PBKDF2("unique string") is needed for key derivation.

Non-zero chance people would enter names, birthdates, street numbers, PINs as a "unique string" and give a thief extra help.

Share payloads, unlike new_ident would incorporate both app entropy and unique_string so those will never collide (if RNG Honest) even if ident collides.

apoelstra commented 1 year ago

Yep, agreed -- we'd xor with the (truncated) output of the PBKDF2.