Account ID structure updates

bobbinth commented 1 year ago

This issues combines several proposed changes to the account ID structure as they are all likely to be implemented together. These changes are:

Reduce account ID size to 60 bits (originally described in #104). This could be done by requiring that during the grinding process the 4 least significant bits of the account ID are set to all 0s.
Expand account storage mode options (originally described in #139). This would give special meaning to the 4th most significant bit of the account ID.
Change how account ID seed is computed (originally described in #87). Specifically, we'd compute the value of the seed as seed = hash(code_root, storage_root, nonce) and then seed[3] becomes the account ID.

Given the above changes, we should probably update the grinding conditions to work as follows:

For regular accounts 16 least significant bits of seed[0] must be zeros.
For faucet accounts 24 least significant bits of seed[0] must be zeros.

This is because we now require grinding on 8 bits within the account ID itself (4 most significant bits for type/storage mode, and 4 least significant bits for reducing ID length to 60 bits). Thus, the total still remains 24 and 32 bits of grinding for regular and faucet accounts respectively.

bobbinth commented 4 months ago

Points 2 and 3 listed above are actually done now (for point 2, we did add an extra bit but for now left it unused). So, the only thing remaining here is reducing the ID size to 60 bits and adjusting PoW threshold accordingly.

bobbinth commented 4 months ago

One of the negative aspects of the current account ID generation scheme is that we kind of "waste" the work that goes into generating the IDs (it is not technically wasted as we do get more security from the operators trying to highjack an account ID, but that's just "one-time" effect). There are some ways in which we could make this work useful - for example:

We could reduce the size of the ID to make it easier to remember.
In case of faucet accounts, we could encode the ticker symbol directly into the ID (as proposed here).

Below is a proposal on how we could alter the account ID encoding scheme to make the PoW more useful.

First, we still maintain the basics of the current scheme in that to derive an account ID we start with account_seed, code_root, and storage_root and compute the ID as follows:

let digest = hash(account_seed || code_root || storage_root);
let account_id = digest[0];

Then, we'd impose the following rules:

For non-faucet accounts, the last 16 bits of the iD must be zeros. The first 2 bits would still define the storage mode of the account (i.e., public, private), and the 3rd bit would define whether the account code is mutable or not.
- This effectively means that the user needs to grind about $2^{19}$ seeds to generate an ID, which on modern smartphone/laptop CPUs should be between 1 - 3 seconds.
For faucet accounts, the last 32 bits of the ID must match the last 32 bits of account_seed[0]. The first 2 bits of the ID would also define account storage mode, and the 3rd bit would specify whether the faucet is for fungible or non-fungible assets. The lower 32 bits would be subject to additional rules described below.
- This effectively means that the user needs to grind about $2^{35}$ seeds to generate a faucet ID, which on modern server CPUs (e.g., 64 cores) should take about 30 mins.

The additional rules for the lower 32 bits of a faucet ID are as follows:

The 4 least significant bits must be 0s.
The remaining bits encode a ticker symbol which can be up to 6 letters long. The alphabet for the ticker consists of 26 letters + 0 symbol (i.e., 27 characters total).
The first character in the ticker cannot be 0. This guarantees that the lower 16 bits of faucet ID will be non-zero.

Some benefits of the above scheme:

Inside the VM, we can pretty easily determine whether an account is a faucet or not - if the lower 16 bits are zeros, it is a non-faucet account, otherwise it is a faucet account.
For non-faucet accounts, the IDs are pretty short 8 - 12 characters depending on encoding (this is within range of memorization for most people). a. For faucet accounts they are also not that long: 10 - 15 chars depending on encoding. And because faucet and non-faucet IDs have different lengths, it should be relatively easy to tel them apart visually.
For faucet accounts, ticker symbol is encoded in the ID and so no lookups are needed to determine the ticker. This also extends to assets (since faucet ID is a part of each asset). That is, by looking at the asset one can immediately determine its ticker symbol.

Potential downsides:

Ticker encoding scheme is too rigid and is the same for fungible and non-fungible assets.
Encoding ticker into the faucet ID / asset may not be a good idea since we can't guarantee that a ticker is globally unique (i.e., two faucets may have the same ticker). The counter-argument to this is that this is no different from storing ticker symbol in storage slot 1 (as we do now).

hackaugusto commented 4 months ago

it is not technically wasted as we do get more security from the operators trying to highjack an account ID, but that's just "one-time" effect

As a side note, I always get confused by this. The account id adds to the security of the hash. That is to say the attacker needs to both, find a result with enough PoW, and which also matches the target account id, for a total of 80bits (~64bits for account id + 16 of Pow). Which is a great place to be in, the user can use a cellphone to generate a valid account in seconds, but an attacker would be in a bad spot to find a match with the same account id.

With that said, we probably should still make this configurable, to maintain the cost in the infeasible area as hash power increases (80bits should be plenty, but it is in the low end). Ideally this should be done in such a way that a user can change the setting, and the protocol is agnostic to it. This has the benefit of giving control to the user, and removing the need for the protocol to estimate or agree on hash power.

We can pack the PoW requirements in the account id. Since we use one of the Felts in the Digest to encode the AccountId, we have a max of 3 Felts to do PoW for a total of 192bits, meaning we need a max of 8bits to encode the PoW requirements, that leaves about 53bits for the account itself, that is about $10^{16}$ accounts (the 8bits would also be part of the PoW, so the maximum PoW would be 200bits)

Edit: This could also be a nice way of getting rid of the patching for the tests. Meaning we can just configure the test to have a very low PoW requirement, and assume users will actually use a high value. The downside of this is that we don't guarantee security, only allow for it, and a improperly configured client could result into problems (not sure how bad this is, it is basically always the case)

hackaugusto commented 4 months ago

For non-faucet accounts, the last 16 bits of the iD must be zeros

We could also pack a protocol version in these bits. Instead of using all zeros, we could use something like chain_id. The benefit here is one of sanity check, I'm not sure if replay attacks would be a thing, because I assume the user would still control the account anyways. But it feels this hasn't a significant downside.

hackaugusto commented 4 months ago

For faucet accounts, the last 32 bits of the ID must match the last 32 bits of account_seed[0]

Maybe lets make this a first class concept and add a metadata field to the hash? so it would be hash(account_seed || code_root || storage_root || metadata) instead. We can make the input a multiple of the hasher rate by changing the size of account_seed. (Edit: And I think the seed size should be at a minimum a word)

Ticker encoding scheme is too rigid and is the same for fungible and non-fungible assets.

A few notes about the ticker:

If we are going to add it as part of the protocol, it may makes sense to make it unique, to prevent spoofing attacks, for example, we should probably have single ETH account id.
We don't allow code changes to faucets, but that doesn't mean a project won't want to upgrade. If we have the above to prevent spoofing, we should probably add a version along side the ticker, and a way to authenticate that a new account is indeed an upgrade of an older version.
The spoofing prevention would also mean there is a race, and the first to deploy the ETH would control that, which can be a huge risk. I'm not sure how to secure against this. We could have a blessed bridge, and then we would have the ETH from the bridge and possible the native ETH, and we would then rely on UI to make it clear to the user which is which. We could try to actively preregister all the known names, which would cover most of the existing cases, but leave the future ones open.
- Perhaps we should add extra metadata, things like denomination names, longer names, project URL, and some primitive to authenticate the account (instead of the uniqueness requirement mentioned above). The data doesn't need to be necessarily stored in the rollup, but maybe its hash would be useful.

bobbinth commented 4 months ago

With that said, we probably should still make this configurable, to maintain the cost in the infeasible area as hash power increases (80bits should be plenty, but it is in the low end). Ideally this should be done in such a way that a user can change the setting, and the protocol is agnostic to it. This has the benefit of giving control to the user, and removing the need for the protocol to estimate or agree on hash power.

We can pack the PoW requirements in the account id. Since we use one of the Felts in the Digest to encode the AccountId, we have a max of 3 Felts to do PoW for a total of 192bits, meaning we need a max of 8bits to encode the PoW requirements, that leaves about 53bits for the account itself, that is about 1016 accounts (the 8bits would also be part of the PoW, so the maximum PoW would be 200bits)

I like it! I don't think we need 8 bits though - 6 bits should be enough. With 6 bits we would get up to ~128 bits PoW (64 bits from the account ID itself and up to 64 bits from the extra PoW specified via these 6 bits.

We could also pack a protocol version in these bits. Instead of using all zeros, we could use something like chain_id. The benefit here is one of sanity check, I'm not sure if replay attacks would be a thing, because I assume the user would still control the account anyways. But it feels this hasn't a significant downside.

This would be a bit more tricky as we probably can't "allocate" too many bits for this (e.g., probably not much more than ~8). I'm also not sure this is needed: if I can create an account ID for a given combination of storage and code on one chain, it may actually be convenient to create the same account ID for the same combination on a different chain.

Maybe lets make this a first class concept and add a metadata field to the hash? so it would be hash(account_seed || code_root || storage_root || metadata) instead. We can make the input a multiple of the hasher rate by changing the size of account_seed. (Edit: And I think the seed size should be at a minimum a word)

Yep - agreed. I'd probably call it something like account_config rather than metadata. Also, I'm not sure if having account seed be a word is needed. I think 128 bits (2 field elements) may be sufficient for the seed. @Al-Kindi-0 - what do you think?

If we are going to add it as part of the protocol, it may makes sense to make it unique, to prevent spoofing attacks, for example, we should probably have single ETH account id.

We don't allow code changes to faucets, but that doesn't mean a project won't want to upgrade. If we have the above to prevent spoofing, we should probably add a version along side the ticker, and a way to authenticate that a new account is indeed an upgrade of an older version.

The spoofing prevention would also mean there is a race, and the first to deploy the ETH would control that, which can be a huge risk. I'm not sure how to secure against this. We could have a blessed bridge, and then we would have the ETH from the bridge and possible the native ETH, and we would then rely on UI to make it clear to the user which is which. We could try to actively preregister all the known names, which would cover most of the existing cases, but leave the future ones open.

I thought about this, but think it would be too risky specifically because of your last point: we don't want to mediate who gets ETH, BTC etc. tickers and we don't want to leave it on the "first-come-first-served" basis.

Perhaps we should add extra metadata, things like denomination names, longer names, project URL, and some primitive to authenticate the account (instead of the uniqueness requirement mentioned above). The data doesn't need to be necessarily stored in the rollup, but maybe its hash would be useful.

We have some of this already. For the sample faucet contract, some of this info is stored in slot 1 of the account storage (this currently also includes the ticker symbol).

Edit: This could also be a nice way of getting rid of the patching for the tests. Meaning we can just configure the test to have a very low PoW requirement, and assume users will actually use a high value. The downside of this is that we don't guarantee security, only allow for it, and a improperly configured client could result into problems (not sure how bad this is, it is basically always the case)

This could work for regular accounts but if we do want to encode extra data into faucet IDs, we would still need to have special treatment for them.

bobbinth commented 4 months ago

As a side note, I always get confused by this. The account id adds to the security of the hash. That is to say the attacker needs to both, find a result with enough PoW, and which also matches the target account id, for a total of 80bits (~64bits for account id + 16 of Pow). Which is a great place to be in, the user can use a cellphone to generate a valid account in seconds, but an attacker would be in a bad spot to find a match with the same account id.

Also, I just realized that the scheme I proposed for regular account IDs actually doesn't give us any extra security beyond 64 bits. It does create extra work for the original account creator to find an ID with 16 trailing zeros, but for the attacker, they jus need to find a 64-bit pre-image and if they use random search, it doesn't matter whether the last 16 bits must be 0 or not.

Same thing actually applies to encoding ticker symbols. Basically, encoding extra info into the ID itself, does require extra work for the user, but I don't think it provides any additional security against a potential attacker. Not sure if there is a way around this yet. @Al-Kindi-0 - curious what you think.

hackaugusto commented 4 months ago

I think 128 bits (2 field elements) may be sufficient for the seed. @Al-Kindi-0 - what do you think?

To expand on my comment, hash is a one-to-one mapping, imagine the worst hash function possible, something like the id function, changing only 2 field elements of the domain changes only 2 field elements in the image, always leaving the top bits of the digest set to 0, which is to say, from the possible $2^{256}$ output values it covers only $2^{128}$, this would be true for a CHF too. IOW, if we only have 2 field elements for the seed, we are guaranteed to not cover the codomain, having the seed size being at least the digest size would make it possible to cover the codomain (but AFAIK there is not guarantee). I have no idea how bad that is, specially because the property that we want is not to cover the codomain, but to have an image that satisfy our criteria, it would likely not be a real issue in practice, but I don't think there are downsides to having the seed being bigger (?)

Al-Kindi-0 commented 4 months ago

Also, I just realized that the scheme I proposed for regular account IDs actually doesn't give us any extra security beyond 64 bits. It does create extra work for the original account creator to find an ID with 16 trailing zeros, but for the attacker, they jus need to find a 64-bit pre-image and if they use random search, it doesn't matter whether the last 16 bits must be 0 or not.

But wouldn't the pre-image found by the attacker go through the "the last 16 bits must be 0"-check at some point for this to be useful for the attacker? Or does any pre-image do?

Al-Kindi-0 commented 4 months ago

Yep - agreed. I'd probably call it something like account_config rather than metadata. Also, I'm not sure if having account seed be a word is needed. I think 128 bits (2 field elements) may be sufficient for the seed. @Al-Kindi-0 - what do you think?

I think this boils down to whether we are concerned about collisions or only pre-image resistance. I think in the current context collisions are not a problem and thus 128 bits for the seed should be fine.

Al-Kindi-0 commented 4 months ago

I think 128 bits (2 field elements) may be sufficient for the seed. @Al-Kindi-0 - what do you think?

To expand on my comment, hash is a one-to-one mapping, imagine the worst hash function possible, something like the id function, changing only 2 field elements of the domain changes only 2 field elements in the image, always leaving the top bits of the digest set to 0, which is to say, from the possible 2256 output values it covers only 2128, this would be true for a CHF too. IOW, if we only have 2 field elements for the seed, we are guaranteed to not cover the codomain, having the seed size being at least the digest size would make it possible to cover the codomain (but AFAIK there is not guarantee). I have no idea how bad that is, specially because the property that we want is not to cover the codomain, but to have an image that satisfy our criteria, it would likely not be a real issue in practice, but I don't think there are downsides to having the seed being bigger (?)

Usually, a hash function compresses the input in some way and the simplest such case would be a 2-to-1 compression. This is clearly not one-to-one and hence I am probably missing something in your comment.

bobbinth commented 4 months ago

But wouldn't the pre-image found by the attacker go through the "the last 16 bits must be 0"-check at some point for this to be useful for the attacker? Or does any pre-image do?

The way I think about it is this:

For the user, the task is to find $x = hash(seed_u || code_u || storage_u)[0]$ such that $x$ has the last significant 16 bits set to 0 (and some other properties which I'm skipping here for brevity) by trying different values of $seed_u$. The work involved in this is $2^{16}$.

For the attacker, the task is to find $x = hash(seed_a || code_a || storage_a)[0]$ where $x$ is exactly the same as the $x$ found by the user (also by trying different values of $seed_a$). Since $x$ is 64 bits (i.e., the first element of the word), the work involved in this is $2^{64}$. This work does not depend on the number of trailing zeros in $x$. For example, if we require the user to find $x$ with 32 trailing zeros, the attacker would still need to expand $2^{64}$ PoW.

Al-Kindi-0 commented 4 months ago

Exactly, the hardness of finding a preimage is dependent only on the size of the image space (i.e., where x lives) for a cryptographic hash function and not on specific properties of the pre-image itself.

bobbinth commented 4 months ago

So, this seems to suggest that the amount of data we can encode in the ID is pretty limited because otherwise we'd impose too big of a computational burden on the user. Basically, we have a trade-off:

Impose PoW against the non-ID portion of the hash (as we do now) to increase the security against ID highjacking attacks.
Encode useful info into the ID.

The PoW requirement for the user would be the sum of PoW for both items. For the attacker, the security in bits would be 64 + PoW from item 1 above.

So, for example, if we want to have 80 bits of security against ID highjacking and 16 trailing zeros in the ID, we'd need 32 bits of PoW for the user - which I think is too much.

I think we do want to have at least 80 bits of security against ID hijacking - which means that we are probably limited to:

4 - 8 bits of useful info in the ID of regular accounts (of which we already use 4). This way, total PoW does not exceed 24 bits.
~20 bits of useful info in the ID of faucets (of which we also already use 4). This way total PoW does not exceed ~36 bits.

hackaugusto commented 4 months ago

Usually, a hash function compresses the input in some way and the simplest such case would be a 2-to-1 compression. This is clearly not one-to-one and hence I am probably missing something in your comment.

My bad, I was trying to paint a picture using the id function and made invalid statements. Basically I was trying to say for n different inputs we will have at most n distinct outputs, we have 256bits space for the output but limit ourselves to 128bits in the input, so we can't cover the whole output space. Not sure how bad that is.

bobbinth commented 4 months ago

for n different inputs we will have at most n distinct outputs, we have 256bits space for the output but limit ourselves to 128bits in the input, so we can't cover the whole output space. Not sure how bad that is.

I don't think this should cause any real problems and I think this happens quite a bit. For example, I believe seed phrases for Bitcoin/Ethereum wallets are frequently 128 bits but the space of potential public-private key pairs is much larger (usually close to 256 bits).

bobbinth commented 4 months ago

Thinking over the above discussion, I think we probably can't encode ticker symbols into faucet IDs. But I think we can still do a couple of things:

Encode PoW requirements into the ID - this could take up 6 bits in the ID.
Reduce the ID size a little by mandating some number of trailing zeros (maybe up to 4 zeros for regular accounts and 16 zeros for faucets).

We could also try to do both.

bobbinth commented 4 months ago

One thing I realized is that we may actually not need to spend PoW on encoding data into the ID. The scheme can be as follows:

let digest = hash(account_seed || account_config || code_root || storage_root);
let account_id = build_id(digest[0], account_config);
assert!(check_pow(digest, account_config));

In the above, build_id() could manipulate the digest[0] element directly to encode info into it. The only restriction is that the result of this manipulation must always be a valid field element and this is not too difficult to enforce.

With this, encoding ticker into the ID is still on the table.

For example, we could say that account_config is a 64 bit value where:

The first 4 bits of account_config define account flags (storage mode, account type etc.).
The next 6 bits specify extra PoW.
For faucet accounts, the last 26 bits define the ticker symbol.

Then, in build_id() we'd do something like this:

Replace top 4 bits of digest[0] with top 4 bits of account_config.
Replace bits 32 - 37 of digest[0] with the next 6 bits of account_config.
For faucet accounts, replace bits 38 - 63 of digest[0] with 26 least significant bits of account_config.
Set bit 31 of digest[0] to 0 (this is to ensure we always get a valid field element).
Use digest[0] as account ID.

This is just an example and we could come up with better ways of doing this.

bobbinth commented 4 months ago

I think one of the questions we need to answer is whether we want to allow users to generate IDs without PoW (i.e., allow them to choose 0 for extra PoW in configurable PoW case). The main benefit of this is that we don't have to patch the code for testing purposes as we can use accounts w/o PoW requirements.

The negatives are:

If we encode ticker symbols into the ID, we probably should require pretty significant PoW to make it more difficult for users to claim thousands of accounts for some existing or future ticker symbol. The counter-argument for this is whether spending extra $1 - $2 USD per ID is really going to deter anyone.
Without PoW, possibility of some DoS attacks may be higher. For example, it would be pretty easy to generate a bunch of IDs all sharing the same 32-bit prefix (which may affect storage performance). The work required for this would be $n \cdot 2^{32}$ where $n$ is the number of IDs. So, for a few thousand dollars one would be able to generate thousands of IDs. Now, if we require even 20 bits of PoW per ID, the cost would be $n \cdot 2^{52}$ - which would be many million dollars even for a handful of such colliding IDs.

Based on the above, my current thinking is that we should probably require at least 20 bits PoW for ID - and potentially more.

One other interesting though: in case we do want to encode ticker symbols into IDs, we could make PoW requirements "dynamic" based on how many IDs with the same ticker symbol have been already generated. The scheme could be:

The structure of the faucet ID would be as follows:

The first 3 bits encode storage mode and whether the faucet is fungible or non-fungible.
The following 29 bits encode a 6-character ticker symbol.
The following 16 bits encode a counter.

The PoW requirement is based on the counter. For counter=0, the PoW requirement is set at some baseline level (say 32 bits), with each increment of the counter, the PoW requirement doubles. So, if someone wanted to generate another ID with the same ticker, the would need to use a different counter (so, would increment it by 1), and would need to perform 33 bits of PoW etc.

Not sure this is a good idea yet - but wanted to share it.

bobbinth commented 4 months ago

Trying to put together the thoughts from the above comments, here is a concrete proposal on how we can refactor account ID structure:

Faucet ID

Structure:

Length: 40 bits (10 chars in hex, 8 chars in base32).
First 2 bits encode storage mode (same as now).
3rd bit is a flag set to 0 for fungible and to 1 for non-fungible assets.
The next 29 bits encode a 6-char ticker symbol using a 27 char alphabet (letters + 0).
The remaining 8 bits encode a counter, which also specifies PoW requirements for generating the ID.

Generation: The ID is computed as follows: the user sets the first 32 bits of account_config (as single field element) the first 32 bits of the desired ID, and the last 8 bits of account_config to the counter:

let digest = hash(account_config || account_seed || code_root || storage_root);
// append the last 8 bits of account_config to the first 32 bits of digest[0]
let account_id = (digest[0] & 0xFFFFFFFF00000000) & ((account_config & 0x00000000000000FF) << 24);

The ID is valid if:

The first 2 bits digest[0] encode a valid storage mode.
First 32 bits digest[0] are the same as first 32 bits of account_config (I believe this implied 32 bits PoW).
- Also, the first 32 bits of digest[0] are not all ones.
digest[3] has as many leading zeros as defined by the value in the last 8 bits of account_config + 4 (i.e., the counter). That is, the minimum PoW is 36 bits (32 + 4), and it grows by 1 bit with every increment of the counter.

Potential downsides:

The number of faucets is probably limited to 10 - 20 billion, but I'm not sure if that's an issue.

Regular account ID

Structure:

Length: 60 bits (15 chars in hex, 12 chars in base32).
First 2 bits encode storage mode (same as now).
3rd bit is a flag set to 0 for mutable and to 1 for immutable accounts.
The next 49 bits is random data.
The remaining 8 bits encode PoW requirement.

Generation: The ID is computed as follows: the user sets the first 3 bits of account_config to the desired properties of the account, and the last 8 bits to the PoW requirement.

let digest = hash(account_config || account_seed || code_root || storage_root);
// append the last 8 bits of account_config to the first 52 bits of digest[0]
let account_id = (digest[0] & 0xFFFFFFFFFFFFF000) | ((account_config & 0x00000000000000FF) << 4);

The ID is valid if:

The first 2 bits digest[0] encode a valid storage mode.
The first 3 bits of digest[0] are the same as the first 3 bits of account_config).
- Also, the first 32 bits of digest[0] are not all ones.
The last 8 bits of account_config contain a value of 20 or greater.
digest[3] has as many leading zeros as defined by the value in the last 8 bits of account_config. Together with 3 bits of PoW required for the first 3 bits, we get minimum of 23 bits PoW for regular accounts.

Summary

We get relatively short IDs - 60 bits for regular accounts and 40 bits for faucets. a. This makes IDs relatively easy to tell apart visually. b. But also makes them easy to tell apart in MASM - i.e., if the 16 least significant bits are all zeros it is a faucet; if not, it is a regular account.
The PoW requirements can be scaled dynamically from some pre-defined minimum (36 bits for faucets and 23 bits for regular accounts).
Faucet IDs encode a ticker symbol - which may help with more user-friendly way of displaying assets (though, we still need to have additional mechanisms for this as we cannot guarantee ticker uniqueness).

hackaugusto commented 3 months ago

One thing that I just realized. On-chain accounts have the same issue as on-chain notes, meaning we need a mechanism to ensure the complete data account is made public, and that needs to be verified by the kernel somehow.

bobbinth commented 3 months ago

One thing that I just realized. On-chain accounts have the same issue as on-chain notes, meaning we need a mechanism to ensure the complete data account is made public, and that needs to be verified by the kernel somehow.

Yep - but in case of accounts we can delay solving this problem a bit more. Basically, the operator won't accept a public account update unless the correct delta is available. This works in the short term. In the longer term, we'll need to validate the correctness of the delta in the kernel. We wanted to create an issue for this, but not sure if we have.

hackaugusto commented 3 months ago

b. But also makes them easy to tell apart in MASM - i.e., if the 16 least significant bits are all zeros it is a faucet; if not, it is a regular account.

I don't think this is a good check. For a regular account, we have the following:

0bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxpppppppp0000

Where x is a digest value, p is the PoW, and 0 is defined as that.

A malicious user could set the PoW to 0, so it becomes:

0bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx000000000000

And then do a bit of PoW to get 4 additional zeros:

0bxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0000000000000000

And now they have an id that is both a faucet and a normal account. So we should change the lower 0b0000 to something else.

Edit: The position of the bit doesn't matter, we just need a bit that is not under the control of the malicious user. The bit could be one of the lowest 4 bits as I mentioned above or anywhere else, that was just the first thing that came to mind. We can also move one of these zeros from the bottom to the top:

0bxx0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx000000000000000

And have it set to 1 when the account is a faucet. That is to say, one approach is to just keep what we already have.

hackaugusto commented 3 months ago

I suppose the idea to position the PoW config at different places was to allow the above check to be performed. If there is agreement on keeping the existing approach, I would instead keep the PoW for both accounts in the same place. That is less confusing, and allows for the same code to be used in the kernel to load the PoW. So instead of:

(account_config & 0x00000000000000FF) << 24 # PoW position for faucets
(account_config & 0x00000000000000FF) << 4  # PoW position for normal accounts

There would be only:

(account_config & 0x00000000000000FF) << 4  # PoW position for all accounts

I also think we should pack the account configuration together. So instead of having these masks for the configuration:

0b1111000000000000000000000000000000000000000000000000111111110000

It would be simply:

0b0000000000000000000000000000000000000000000000000000111111111111

This makes the MASM code simpler, since we can use a u32clz to check the zeros on the high half, and extract the config values from the low half with u32and and u32shr

Edit: Having the zeros on the highbits also makes the final value smaller, which is easier on the eyes :)

hackaugusto commented 3 months ago

The next 29 bits encode a 6-char ticker symbol using a 27 char alphabet (letters + 0).

I don't understand this. At first I thought a code point would be 5bits, but that would mean we need either 30bits or 25bits, the former with 6 chars, the later with 5. This would have additional problems:

5bits can encode 32values, but we define the meaning of only 27values
If the last 5 values are invalid:
- We would need to write complicated MASM code to validated it
- It reduces the total number of representable values

So maybe the encoding is done with something like this?

fn chars(mut v: u64) -> [u8; 6] {
    let mut pos = 6;
    let mut res = [0; 6];
    while pos > 0 {
        let char = v % 27;
        pos -= 1;
        res[pos] = char as u8;
        v /= 27;
    }
    res
}

But encoding a value [26, 26, 26, 26, 26, 26] gives me:

0b1001101111011111000101001010110010

Which is 34bits and not 29bits. What is exactly the encoding you had in mind?

Edit: I had a mistake on the previous comment, I had an off-by-one because I forgot to count 0, the previous code was using %28 instead of %27. The question is the same though.

Edit2: The simplest solution IMO is to just use 5bits per char, set 30bits for the tick, and define the values from 27 to 32 to anything, e.g. []()|

Edit3: Humm ... "Which is 34bits and not 29bits", that can't be right, can it? How would this need more bits than the version that uses 5bits per code point?

Edit4: Yeah, no, something is broken on my code above. (27 ** 6) - 1 is 0b10111000101111001000101001000, which fits the 29bits mentioned before.

hackaugusto commented 3 months ago

Actually, why is the account_id generation overwritting values?

let account_id = (digest[0] & 0xFFFFFFFFFFFFF000) | ((account_config & 0x00000000000000FF) << 4);

Instead, if we use the packed account_config like it is suggested above:

0b0000000000000000000000000000000000000000000000000000111111111111

We can require the digest to match the config during validation:

digest[0] & CONFIG_MASK == account_config

That makes the code simpler (no patching of the digest). Analysis is also simpler, since we don't have to keep remembering to count bits that are discarded because they are overwritten. It also increases the PoW, meaning we can require fewer zeros and have a bigger account id space.

hackaugusto commented 3 months ago

Latest design:

Remove tickers
Both the faucet and regular accounts have size 60bits
The metadata of 4bits stays in the high bits, the metadata is overwritten to the account id
The PoW is an absolute value with a minimum per account type (24bits for regular/ 35bits for faucet). Its encoding is lowered from 8bits to 6bits, and it counts only against the digest[3]

Rationale:

Each non random bit in the account id reduces the advantage the user has against an attacker.
- because:
  - With an overwrite mode, let account_id = (digest[0] & 0xFFFFFFFF00000000) & ((account_config & 0x00000000000000FF) << 24);, the bits excluded by the mask 0xFFFFFFFF00000000 don't have to be matched by the attacker, so they no longer require PoW, reducing the advantage
  - With the validation mode (for account config and tickers), both the user and the attacker would need to do the PoW, also reducing the advantage
- as a consequence:
  - the number of zeros in the account ids have been reduce to a minimum.
  - all checked bits in the account id have been remove in favor of overwrite
  - the tickers have been removed
We can't move the metadata because things like the tag schema depends on it
We shouldn't move the account config together with the metadata, because the high bits are used to key values in the sparse merkle trees, which would increase collisions
The minimum PoW is not to protect the users, but the network. It is spam protection
The maximum PoW for the user is 64bits, for the attacker is 124bits. The minimum PoW for the user is 24bits for the attacker is 88bits. 60bits is plenty for, so 88bits is on the safe side, and 128bits will be safe for a while

hackaugusto commented 3 months ago

Note: The scheme above has a problem. Because the metadata bits are overwritten, it is possible for a user to grind to find ones in the high random bits, and choose a metadata to set the other high too, the two values individually are valid felts, but the result of merging the two may not.

In theory, it would allow a user to set all high bits and wrap the account id. I can't think of an actual attack coming out of this, but it is super odd behavior. (Currently this is not possible because we don't have the 0b11 storage flag.)

One solution to the issue above is to require the user to grind the metadata and pow bits, since that will force the result to be a valid felt.

Edit: I'm changing from overwrite mode to check mode. Even though we don't get any advantage from the grinding in this case, the MASM code turns out simpler and the problem above goes away.

hackaugusto commented 3 months ago

Note: I think we should forbid P2ID for accounts that have not been created yet. My rationale is the following:

For the P2ID to be created two things are needed 1. Alice needs to create the account id, and provide it to Bob to be used 2. Bob needs to create a note with the P2ID to Alice's new account id, and share it with her somehow
Sharing the id is required for both actions above, which requires thought to prevent leaking the ids. Otherwise the assumption we have regarding the time an attacker have to grind the id breaks

If instead of P2ID we incentivize something like pay-to-hash, then we get a few benefits:

The secret doesn't have to be shared by Alice to Bob, only the hash.
We consider breaking the hash infeasible, so we don't have to worry about timing
We don't lose any functionality

Unfortunately, even if the P2ID script requires the account to exist, that doesn't mean it wouldn't work if people really wanted to use that (by, for example, creating the id, going over the steps above, and then when consuming the note one would send two transactions, one that creates the account followed by one that consumes it). So this is mostly a best effort kind of thing, but IMO better than the alternative.

bobbinth commented 3 months ago

Sharing the id is required for both actions above, which requires thought to prevent leaking the ids. Otherwise the assumption we have regarding the time an attacker have to grind the id breaks

I actually don't think that's a problem: the assumption we make is more about cost rather than time. That is: it is going to cost the attacker more than $100B USD to to generate exactly the same ID for different code/storage root. We can then take a pretty conservative safety margin (e.g., 1000x) and say that as long as my account doesn't have more than $10M USD sent to it, nobody is going to bother attacking it.

So, basically, the condition becomes: I need to record my account on chain before people send me $10M USD. Up until that time, it is safe to give my account ID out without recording it.

hackaugusto commented 3 months ago

We need a mechanism to protect against rainbow table attacks, see @bobbinth 's comment

hackaugusto commented 3 months ago

Some notes:

Currently there is a ZERO word in the digest computation used as padding. We can increase the seed from 1 to 2 words without incurring any cost in the VM, while adding 8bytes to each entry of the table, increasing it by 33% for the existing format.
- Or even better, we use this word to store some time reference, used to invalid the hashes in the table after some time.
- One possibility is the rollup's epoch, an attacker would have to recompute the complete table each epoch.
- Another better possibility is a block header, and we would require the block header to be no older than say 100 blocks. [Edit: well, 100 is way too low, because PoW for faucet can take a bit of time, lets say something like 3 days worth of blocks]

Some requests for @bobbinth :

I think we should write our cost model down. In this particular case these are the things I have in mind:
1. Cost of acquisition. For example 300 USD a piece for a GPU.
2. Hash power. For example 1M hashes/sec per GPU
3. Cost per million hashes. For example 0.2 USD (basically energy cost)
4. The hash function this is based on
The above should provide the following:
- A common reference to write estimations, and to extrapolate on
- A hardware baseline, because the hash power changes drastically from cpu/gpu/asic
- Document all these assumptions and make it readily available
What are the requirements
- Minimum cost to attack a specific account (i.e. hash collision cost)
- Minimum cost to attack any account (i.e. rainbow cost)
- Minimum cost to produce a normal and faucet account (i.e. spam protection cost)
- Minimum account id space (i.e. how many accounts)
- Expected lifetime, 10/20/50/100 years (i.e. how long in the future should we target, before assuming the protocol would be updated, to set the time frame to be used for our cost extrapolations)
- What are the actors we are protecting against, and some rough budget ideas (i.e. a hacker with 5k hardware, or a state actor with some budget, 1Mi/1Bi, etc) [Well, I suppose this alone would implicitly define the points above, so this is probably enough. Maybe we should have a distinction of cost per account vs cost for the rollup]

I would use the above to:

Assume the Moore's law is not dead, and given the time frame, extrapolate the updated cost per million hashes
Estimate the cost for each actor, and then deduce the required params from that

hackaugusto commented 3 months ago

For the hash power, one thing that we could do is to use the bitcoin network.

Highest difficulty was on the week 25/Apr - 8/May this year, I picked a random block 842000 from the 4/May, the bits is set to 0x170331db which corresponds to a little more than $2^{68}$ hashes/sec, computed with:

import math
bits = 0x1703629a
exp = (bits >> 24) - 3
mantissa = bits % 2**24
target = mantissa * 256 ** exp
per_block = 256 - len("{:0256b}".format(target).lstrip('0'))
secs_per_block = 10 * 60
hashes_per_sec = (2 ** per_block) / secs_per_block
math.log(hashes_per_sec, 2)

Looking at some online graphs 1 2 3, they all seem to agree that the hash power is around $650e18$ or $2^{69}$ H/s.

Assuming the network is giving a profit to the miners, the current reward is at 3.125 BTC, and the price is at 60K/EUR at the time, that gives a cost per trillion hashes of 0.0002 EUR , computed with:

btc_price = 60_000
trillion = 10 ** 12
hash_power = 2 ** 68
btc_price / (hash_power / trillion)

or about 1EUR for $2^{53}$, computed with:

btc_price = 60_000
hash_power = 2**69
hashes = 2 ** 53
btc_price / (hash_power / hashes)

(I think the above is a bit on the cheap side, because the price is on a all time high, and it assumes the whole fee amount is used to pay for variable costs, ignoring profit)

To estimate the cost of acquisition for this hashing power, it seems machines are at 350TH/s for about 10k USD, so the above hashpower should have an initial cost of about 8Bi, computed with:

hash_power = 2**68
teha = 10 ** 12
units = hash_power / (350 * teha)
cost_per_unit = 10_000
units * cost_per_unit

I'm not sure if the above figure makes sense, but I don't have a better estimation.

bobbinth commented 3 months ago

Assuming the network is giving a profit to the miners, the current reward is at 3.125 BTC, and the price is at 60K/EUR at the time, that gives a cost per trillion hashes of 0.0002 EUR , computed with:

I think comparing with BTC is on the conservative side (which may not be a bad thing) as SHA256 is a way more performant hash function as compared to RPO and also many billions of dollars were spent on building ASICs for it.

If we take some current CPU performance, it will look something like this:

Hashes per second per CPU core: ~500K
Hours to compute $10^{12}$ hashes: ~550
Cost of CPU core per hour: ~$0.01 USD
Cost to compute $10^{12}$ hashes: ~$6 USD

With GPUs, we can probably get about 20x reduction in cost, so, we'd get cost per trillion hashes around $0.3 USD. Or about $300 for $2^{50}$ hashes.

To get to about 1 EUR for $2^{53}$ hashes, we'd need another 2000x improvement, which, I guess with enough investment and time is realistic.

To summarize:

We can estimate current cost of computing $2^{60}$ hashes at around $300K USD.
Over a long enough timeframe and with many billions of dollars of investment, this can probably be reduced to about $100 USD.

bobbinth commented 3 months ago

Another better possibility is a block header, and we would require the block header to be no older than say 100 blocks. [Edit: well, 100 is way too low, because PoW for faucet can take a bit of time, lets say something like 3 days worth of blocks]

This is interesting - though, one downside is that the user would have a limited time to record their account on chain and 3 days (or something similar) is probably too short.

One potential mitigation for this is to "anchor" each ID to a specific epoch. This could be done by setting some 16 bits of an account ID to 16 bits upper bits of the block number. Then, block hash of that epoch would need to be used in computing account ID. This would basically force the attacker to recompute the rainbow table every $2^{16}$ blocks (e.g., every 3 - 7 days) and if the cost of this is signifiant enough, this may be an acceptable solution.

bobbinth commented 3 months ago

I guess one question we should answer is whether we should just go to something like 120-bit IDs and then all problems would go away. The main downside of 120-bit IDs is that we'd need to change how our assets are structured - but if we can find a good solution to this, it may be the best approach.

hackaugusto commented 3 months ago

The main downside of 120-bit IDs is that we'd need to change how our assets are structured - but if we can find a good solution to this, it may be the best approach

At first I thought you meant because the account id is encoded into the assets. But we have enough room to store the amount (which is a felt), and have 2 felts left to encode the non-fungible tokens. So that shouldn't be the problem.

Is the issue you have in mind the size of the leaves in the SMT?

bobbinth commented 3 months ago

At first I thought you meant because the account id is encoded into the assets. But we have enough room to store the amount (which is a felt), and have 2 felts left to encode the non-fungible tokens. So that shouldn't be the problem.

This is true for fungible assets, but not for non-fungible assets. For non-fungible assets we also put account ID into the asset. This is OK if ID size is just 1 field element, but if it is 2 then the collision resistance of non-fungible assets goes down to 64 bits, which is not great.

hackaugusto commented 3 months ago

This is OK if ID size is just 1 field element, but if it is 2 then the collision resistance of non-fungible assets goes down to 64 bits, which is not great.

The account id should be random, no? So I suppose you're saying that for non-fungible tokens, there is only 1felt of random data to randomly assign it to a bucket in the SMT? And that would be a problem if an account has many non fungible assets from the same account?

bobbinth commented 3 months ago

So I suppose you're saying that for non-fungible tokens, there is only 1felt of random data to randomly assign it to a bucket in the SMT? And that would be a problem if an account has many non fungible assets from the same account?

Yeah - it is mainly an issue if we try to issue many non-fungible assets from the same account and some of them end up with the same 64-bit prefixes (it is actually only 32 bits of work to find 2 assets with the same 64-bit prefix). We could try to re-arrange the non-fungible asset somehow to make it so that the first 2 elements are unique for each asset issued by the same faucet, but that only gives us 64 bits of collision resistance. May not be too big of a problem though (if handled appropriately).

hackaugusto commented 3 months ago

Yeah - it is mainly an issue if we try to issue many non-fungible assets from the same account and some of them end up with the same 64-bit prefixes

But that is already the case, no? The vault key is the whole word, but the SMT uses only most significant felt/64bits as a key.

(it is actually only 32 bits of work to find 2 assets with the same 64-bit prefix).

I don't understand this, could you expand a bit?

bobbinth commented 3 months ago

(it is actually only 32 bits of work to find 2 assets with the same 64-bit prefix).

I don't understand this, could you expand a bit?

Due to birthday paradox it takes the complexity of finding collisions is square root of the search space. So, for a 64-bit search space, we need to spend 32 bits of work to find two collisions.

0xPolygonMiden / miden-base