MetaMask / metamask-extension

:globe_with_meridians: :electric_plug: The MetaMask browser extension enables browsing Ethereum blockchain enabled websites
https://metamask.io
Other
11.57k stars 4.73k forks source link

Nonce calculation is broken for private networks #1999

Closed iwooden closed 3 years ago

iwooden commented 6 years ago

Currently, nonce calculation is broken for private networks. The main problem is that transaction history is retained for an account even after switching the RPC endpoint to a new network. This can cause the nonce on a new network to be too high, resulting in Metamask transactions never going through.

To reproduce:

  1. Get access to two different private network RPC endpoints.
  2. Create a new account/address in Metamask, with no previous transaction history.
  3. Connect Metamask to the first private endpoint (select "Custom RPC" from the network selection dropdown, enter the RPC URL).
  4. Submit and sign an Ethereum transaction. Note the transaction entry in the transaction history section.
  5. Connect Metamask to the second private endpoint. Note that the transaction entry is still in the transaction history section.
  6. Submit and sign an Ethereum transaction. Note that the transaction hangs in a "pending" state in the Metamask UI.
  7. If you have access to the geth console for one of the nodes servicing the second RPC endpoint, check the txpool. Note that the transaction submitted was submitted with a nonce of 0x1, when the transaction count for the address in question is 0. The transaction will never go through.

I can see in app/scripts/lib/nonce-tracker.js that you're taking the max value between the locally calculated nonce and the transaction count for the address in question. This works for public/test networks where the expected nonce for an account can never go down, but that assumption doesn't hold for private networks when you can switch from a chain where an account's transaction count is 8 to one where the transaction count is 0.

One possible fix is clearing the "Private Network" transaction history when switching RPC endpoints. This isn't totally ideal, but does result in the correct nonce being calculated, and doesn't require you to redo any of the current nonce calculation functions.

szerintedmi commented 6 years ago

I'm struggling with the same issue while testing. Have you found a workaround? Even restarting the browser doesn't help sometimes.

danfinlay commented 6 years ago

Sorry for neglecting this over the long weekend, I'll be working to fix this soon.

iwooden commented 6 years ago

No problem Dan, thanks for taking the time. I investigated a little further, and it looks like this issue has to do with network ID specification for private networks.

Account history is tied to network ID. So, when you switch to a private network with a different network ID, you will get a fresh account with the correct nonce. However, if you switch to a private network with the same network ID, the account history is the same and the nonce mismatch will occur.

So one workaround for right now is to ensure that the various private networks you interact with have different network IDs. However, it would still be really nice to have the option to delete the account's history from the UI, forcing the correct nonce calculation.

danfinlay commented 6 years ago

Oh, I may have misunderstood the nature of this problem. I think private networks should be responsible that they have unique IDs, and we shouldn't work too hard to ensure identically-identified networks both work with MetaMask.

szerintedmi commented 6 years ago

My guess in my case that the problem is not with the network id but the fact that I use the same accounts for different networks while testing.

  1. I execute a tx with account 0x1.. on privatechain with networkid 999 - works fine
  2. I stop private chain and launch testrpc with networkid 888
  3. execution of tx with same account 0x1.. fails on testrpc: invalid nounce. Usual metamask vodoo (change network back and forth) doesn't help, neither browser restart.
danfinlay commented 6 years ago

My guess in my case that the problem is not with the network id but the fact that I use the same accounts for different networks while testing.

Lots of people do that, so it seems unlikely that if this were the bug, it would affect so few people.

I followed your reproduction steps (albeit with testrpc on both instances), and has no problem.

To clarify: You need to specify a distinct networkId and distinct chainId for MetaMask to identify a distinct network. Could you try that and let me know if it works, @szerintedmi?

szerintedmi commented 6 years ago

I retried and I still had the issue with one of my accounts. All other accs work. (on MetaMask v3.9.13)

I played a bit and I have a workaround.

That's what I receive:

popup.js:83532 [ethjs-rpc] rpc error with payload {"id":9271578721185,"jsonrpc":"2.0","params":["0xf893198504a817c800831e848094f99564a5786fedef72ad45a7c85c3e7c9394522a880b1eb7ea25a00000a41509c8cf00000000000000000000000000000000000000000000000000000000000000008207f2a0666e7d3e499ee4a7cec79b1d112bcb6ded924a2dc2b35d246275f50030b3c9cda0445ebcf4e72e1f37280d3521cc4b22c1c93dcdcb9562afdcbde6bc38b8a45b57"],"method":"eth_sendRawTransaction"} Error: Error: the tx doesn't have the correct nonce. account has nonce of: 24 tx has nonce of: 25
    at runCall (/usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:69351:10)
    at /usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:11327:24
    at replenish (/usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:8420:17)
    at iterateeCallback (/usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:8405:17)
    at /usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:8380:16
    at /usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:11332:13
    at /usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:64434:16
    at replenish (/usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:64381:25)
    at /usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:64390:9
    at eachLimit (/usr/local/lib/node_modules/ethereumjs-testrpc/build/cli.node.js:64314:36)
(anonymous) @ popup.js:83532

When I send a tx without MetaMask it works.

It seems the nonce increases with every tx I send from the non MetaMask window:

Error: the tx doesn't have the correct nonce. account has nonce of: 24 tx has nonce of: 29
Error: the tx doesn't have the correct nonce. account has nonce of: 26 tx has nonce of: 29

Once the nonce reaches the MetaMask nonce I can send tx again through MetaMask.

danfinlay commented 6 years ago

I'm very tempted to think that the one affected account is having problems because it specifically had transactions sent on a private chain with identical IDs, just because that's the simplest explanation I can think of.

If that's not it, I'd love to cook up an edge case for our nonce-tracker that it's failing. You may have guessed that when we compose a transaction, we try to account for locally pending transactions, and so it's possible that on a new private chain with the same ID as another, we respect the highest nonce between the network-provided one and the locally-pending-tx derived one, and increment from there.

Let me know if you come up with any theories on this, I'm chipping away on another issue that I understand at the moment, but would be happy to fix any incorrect behavior here if we could identify it.

atomical commented 6 years ago

This is happening to me too. What's the alternative for using Truffle?

okwme commented 6 years ago

happens to me from time to time. no idea why. i uninstall and reinstall metamask to fix it.

aktary commented 6 years ago

@danfinlay you said

To clarify: You need to specify a distinct networkId and distinct chainId for MetaMask to identify a distinct network.

How does one set a new chain id?

I'm having this problem too. Using the same snapshot and account locally and on a dev server, the nonces get out of sync and I can't execute transactions on the local testrpc now.

danfinlay commented 6 years ago

How does one set a new chain id?

It is dependent on the client you are using. I believe the flag is --chainId on geth, although they haven't documented this feature.

onetom commented 6 years ago

I just got the same error message:

Error: the tx doesn't have the correct nonce. account has nonce of: 0 tx has nonce of: 1
   at runCall (/usr/local/lib/node_modules/truffle/build/chain.bundled.js:60148:10)
   at /usr/local/lib/node_modules/truffle/build/chain.bundled.js:12311:24
   at replenish (/usr/local/lib/node_modules/truffle/build/chain.bundled.js:9404:17)

I'm using

If I understand correctly making a couple of transactions against an ephemeral chain, then restarting such chain will lead to this situation.

Reinstalling the MetaMask extension solved the issue "of course".

danfinlay commented 6 years ago

There are two different issues people are reporting here:

  1. They are using private blockchains, and need to use the chainId parameter, so that MetaMask's EIP 155 compatibility works with it correctly.

  2. People are developing against a local blockchain, and they reset the blockchain while MetaMask is pointing at it. In this case, you can work around the problem by switching to another network and back again, no reinstallation needed.

rheaplex commented 6 years ago

To move this off of Twitter. :-) Switching between networks isn't working for me with Truffle 4's truffle develop (which uses the same mnemonic each time) and latest Metamask in either Chromium 62 or Firefox 57 on Debian Stretch. I may be doing it wrong but I can't work out how.

An example of a failed transaction in Firefox 57 with MetaMask 3.12.0 is:

Error: [ethjs-rpc] rpc error with payload {"id":9794961516034,"jsonrpc":"2.0","params":["0xf8ac088504a817c80083012c7e94345ca3e014aaf5dca488057592ee47305d9b3e1080b844a9059cbb000000000000000000000000f17f52151ebef6c7334fad080c5704d77216b73200000000000000000000000000000000000000000000000000000000000001f48222e2a06dee046890a2f3476238691be9bced035939f1c2f3e9d71dc585719412818d08a05a3c71c9227723b4321ac44e3a013a3d6a6907712e63dfa81d98739bf604a145"],"method":"eth_sendRawTransaction"} Error: Error: the tx doesn't have the correct nonce. account has nonce of: 4 tx has nonce of: 8 at runCall (/usr/lib/node_modules/truffle/build/chain.bundled.js:60148:10) at /usr/lib/node_modules/truffle/build/chain.bundled.js:12311:24 at replenish (/usr/lib/node_modules/truffle/build/chain.bundled.js:9404:17) at iterateeCallback (/usr/lib/node_modules/truffle/build/chain.bundled.js:9389:17) at /usr/lib/node_modules/truffle/build/chain.bundled.js:9364:16 at /usr/lib/node_modules/truffle/build/chain.bundled.js:12316:13 at /usr/lib/node_modules/truffle/build/chain.bundled.js:55231:16 at replenish (/usr/lib/node_modules/truffle/build/chain.bundled.js:55178:25) at /usr/lib/node_modules/truffle/build/chain.bundled.js:55187:9 at eachLimit (/usr/lib/node_modules/truffle/build/chain.bundled.js:55111:36)
danfinlay commented 6 years ago

Reviewing the comments in here, this issue is actually a little bigger than what I was suggesting for switching networks and back again. Sorry about the skim.

The problem here is that MetaMask calculates nonces locally on a per-network basis. That means if you connect to it to "the same network" (by ID) on two different endpoints, MetaMask will currently assume these are the same network, use its same history of successful transactions, check the nonce everywhere it can, assume the current node is behind (since MetaMask is aware of a newer tx), and it'll use the latest tx it knows of to calculate nonce.

Current Causes

What we need here is a way of detecting a new network, even when all the signs indicate it's the same network. Some of the signs that prevent MetaMask from noticing this today include:

  1. When the new network is added to the same address without notice.
  2. When the two networks share the same network ID.

Possible Solutions

The second one should only be used if no good solution can be found for the first one.

Some detection strategies:

Periodically check network and chain IDs

This can only partially alleviate cause #1, because if the networks share IDs, it would go undetected.

Detect when checking nonce

Since people usually experience this when sending a TX, nonce calculation time would seem like a good time, but I'm scratching for a good, definite set of indications that the chain is different.

Tracking known blocks

A method for identifying a new network on demand could be fairly reliably implemented (as long as new chains were not identical to previous ones):

Since this method requires re-checking known blocks on the node, the question rises of when we do this. Nonce calculation is a nice failsafe, since this is when the problem first burns people, but it would be better if we could detect immediately, so we don't show wrong balance, show incorrect tx history, etc.

Re-Checking known successful transactions

There's a tricky bit around "confirmed transactions". Part of why we use locally confirmed transactions, is because MetaMask is sometimes pointed at RPC endpoints that might not be fully synchronized, and in these cases, we still want to generate valid nonces.

We could re-check our oldest known successful transaction by hash, and if it's unknown, that could also be used to signal a remote network change.

Times we could perform this procedure

Conclusion

I think I have a fairly actionable solution here, please add any improvements anyone can think of. It seems like we've had a big bump in developer experimentation recently, because this has definitely been behavior for the lifetime, but we've had a huge spike in actual complaints, so it seems like this feature is fairly important to some segment of users.

danfinlay commented 6 years ago

When detecting a new network with an identical ID, some special things will need to be done:

Dealing with TX history

Either:

Short term easy solution is to trash that chain's tx history.

danfinlay commented 6 years ago

Ok, so I've done a lot of this work now on this branch, but it doesn't work with TestRPC yet because of a testrpc bug where it throws an error when queried for an unknown transaction. Without a proper null response, MetaMask would have to parse the error on a per-client basis to account for non-conformity, and I'd really rather not go down that right now.

Once that issue is closed, this branch should be ready for PR & QA.

danfinlay commented 6 years ago

Currently blocked by testrpc fixing their getTransactionByHash method so we can detect a new network.

CharlesFr commented 6 years ago

Any ideas when this will be fixed? It's constantly happening despite changing network_id: "6" to a random number.

danfinlay commented 6 years ago

@CharlesFr Could be fixed very soon. I'm not sure what problem you're having, though. It could be the ol' geth users' EIP155 issue, ie "set chainId to the same as networkId".

tcoulter commented 6 years ago

Hey @danfinlay. @benjamincburns Suggested to me that the getTransactionByHash issue you were seeing above was due to an old TestRPC version. Is that still a blocker?

danfinlay commented 6 years ago

We'll be re-QA'ing today, forgot about that, thanks @tcoulter!

robertsdotpm commented 6 years ago

A work around for truffle develop is simply to patch it to use a random network ID. Open up /usr/lib/node_modules/truffle/build/cli.bundled.js

Cnt + F for var testrpcOptions = { host: "localhost", port: 9545,

Look for network_id and replace it with network_id: Math.floor(Math.random() * 20000) + 2028 (If you want to add mining here too you can add blocktime: 1 to the params)

Then look down slightly for the logging code:

config.logger.log(Truffle Develop started at ${url});

And add a line to show the new network ID:

config.logger.log("Network id = " + testrpcOptions["network_id"]);

It's not perfect but you'll at least be able to get some work done. Sounds like MetaMask is close to being fixed now anyway though.

adrianmcli commented 6 years ago

Is this issue still being looked at? I am experiencing this issue almost a month after the QA for the fix was supposed to happen.

danfinlay commented 6 years ago

Sorry, that fix was put aside pending a more long term fix. The simple work around for now is to start your testrpc with a new network/chain ID each time.

robertsdotpm commented 6 years ago

In the end, it was too painful for me doing any kind of workaround. I now don't use metamask for the testing of live apps. For that, I use geth running on a private chain with web3 setup to connect to it from the app. I still use truffle tests for unit testing though.

If anyone wants to switch to using geth like this there's docs on the truffle website.

altaudio commented 6 years ago

I managed to get the work around working, but my file path was ‘/Users/alexsmith/.nvm/versions/node/v8.4.0/lib/node_modules/truffle/build/cli.bundled.js’ and is given in terminal after running ‘npm install -g truffle’

adrianmcli commented 6 years ago

In the end, it was too painful for me doing any kind of workaround. I now don't use metamask for the testing of live apps.

@robertsdotpm this is what I've found as well. I've given up on MetaMask until I am ready to deploy onto a testnet. This sucked up way too much time for it to have been productive.

wanderingstan commented 6 years ago

Building on the fix from @robertsdotpm, the following sed command will patch truffle to use a random network and display the network on start:

sed -i .bak 's/network_id: 4447/network_id: Math.floor(Math.random() * 20000) + 2028/; s/config.logger.log(`Truffle Develop started/config.logger.log(`Truffle Develop started with network id ${testrpcOptions.network_id}/' $(npm root -g)/truffle/build/cli.bundled.js

Sample output:

$ truffle develop
Truffle Develop started with network id 10277 at http://localhost:9545/

It will create a backup of the file named cli.bundled.js.bak. Of course, this patch will be overwritten if you update or reinstall truffle.

Note that truffle closed their related issue, assuming this would be fixed in metamask.

danfinlay commented 6 years ago

@wanderingstan Let me clarify:

We have not indefinitely postponed working on this fix. Please stop spreading that rumor.

The initial patch we had was found to not be satisfactory, was a bit of a hack, and we were working with the Truffle team on making a more sustainable long term solution. The solution we began to favor was to add websocket support to both Truffle and MetaMask, because then we could actually detect a disconnect, and reload our network state. There's not a great way to detect a remote server restart over the RPC polling API.

MetaMask has an open issue for completing that websocket support, and it has a bounty. https://github.com/MetaMask/metamask-extension/issues/2350

The MetaMask team is mostly currently taking a very much needed break for the holiday, we'll be resuming this work early in the new year.

wanderingstan commented 6 years ago

Thanks for the clarification and good news. No harm was intended. From the outside, "...put aside pending a more long term fix" sounded a lot like "indefinitely postponing."

Glad Metamask working with the Truffle folks. I convinced our team to move to truffle develop from testrpc, and this issue has caused a lot of confusion and lost time.

Do get a lot rest. All signs point to 2018 being a busy year for all of us in this space!

elie222 commented 6 years ago

This has been really painful for development. Any ideas when this will be fixed? Sounded like we were almost there a month or two ago.

danfinlay commented 6 years ago

Just over a month ago we had a hacky workable solution, and then crypto-kitties happened, and our end of year meetings and workshops happened, and then the holidays happened, and now we're ramping back up.

Please be patient as our team scales to meet the demands of this rapidly growing ecosystem.

SpasiboKojima commented 6 years ago

How about now? I still have this issue, and it's really annoying. Seriously, guys, I don't want to be rude or put rush, but two months already

benjamincburns commented 6 years ago

@danfinlay if you guys still don't have capacity to handle this as of the week after next, perhaps I could pick it up, provided you'd have time to hop on a call for an hour or so to take me through your plan of attack.

danfinlay commented 6 years ago

We talked to the truffle/ganache team, the workaround they favored was adding websocket support to MetaMask, so we can more easily identify the disconnect & reconnect, and it is coming along well:

https://github.com/MetaMask/metamask-extension/issues/2350

danfinlay commented 6 years ago

Oh hey @benjamincburns I missed it was you. Yeah, we should coordinate on the attack, at the very least we should be on the same page, will probably involve some changes on both sides of our codes.

benjamincburns commented 6 years ago

the workaround they favored was adding websocket support to MetaMask

I don't remember being part of that conversation (though I very well could have been).

Let's circle back and discuss more synchronously sometime soon (next couple of weeks). My thinking was that since your connections use http keepalive (you definitely do, because until recently Ganache & co wouldn't exit until MetaMask disconnected, and the block tracker polling means the keepalive timeout is never hit) there is hopefully a simple way that you can already detect disconnect.

wbt commented 6 years ago

Could there be a UI to remove locally stored information about a particular account on a particular network, and/or recomputation of the correct nonce when the user enters the password to unlock Metamask?
I've found that the workarounds of switching networks or disabling/reenabling metamask aren't successful in addressing the issue, and other workarounds make development harder than it should be.

danfinlay commented 6 years ago

Could there be a UI to remove locally stored information about a particular account on a particular network, and/or recomputation of the correct nonce when the user enters the password to unlock Metamask?

Yeah, actually that wouldn't be hard as a shorter-term workaround. Maybe we could tuck it under settings or something? Probably wouldn't want users messing up their accounts by putting it in the main menu.

wbt commented 6 years ago

I've also got a less painful short-term Truffle-developer workaround: It's a little migration file that requires a hard-coded input of the transaction number MetaMask is sending, and runs otherwise-useless transactions from the source account to make up the difference between current and target transaction numbers. It's not 100% scalable but can dramatically increase the time between metamask reinstalls, probably beyond the time it'll take to fix this bug!

wbt commented 6 years ago

Also, this probably isn't the only issue in which Metamask's cache could cause problems. Having a way to clear out any locally cached information that can then be re-read from the blockchain may be helpful for other yet-unreported issues.

mryellow commented 6 years ago

UI to remove locally stored information about a particular account on a particular network

Could kill two birds with one stone and implement a method for removing accounts. Then you simply remove and re-add.

Now instead of having a bunch of orphan test accounts I can simply delete them. #2638 #2028 #1658

wbt commented 6 years ago

Having the transaction count checked at the time the person enters their password would be a reasonably logical place to put it, and more logical than checking periodically (as proposed in https://github.com/MetaMask/metamask-extension/issues/2373). This would also fit well with a development workflow in which someone finishes a work session and closes things out, then on the next work session starts up Ganache/testrpc and reauthenticates with Metamask.

@bmmpxf's idea of "reset network" would be better; I'd put that in a right-click menu when right-clicking on a network in the drop-down list of networks. (I'm not sure how easy it is to actually build that). Also, I'd call it something other than "reset network," because it's just resyncing the locally cached data with the network, not modifying anything on the network.

Both would be even better.

danfinlay commented 6 years ago

Could kill two birds with one stone and implement a method for removing accounts. Then you simply remove and re-add.

This always seems simple, but the seed phrase makes it complicated, as discussed in some of the issues you linked: It always generates in the same order. "Removing an account" has always had a debated meaning in this context. In yours, it means wiping all of its related data. Sometimes people just want to hide it, with the option of revealing later. While that distinction is unclear, I don't agree this is a trivial "two birds" change to make.

Having the transaction count checked at the time the person enters their password would be a reasonably logical place to put it

We're talking about situations where the backend blockchain is reset while MetaMask is connected to it.

We currently check tx count when a user is trying to send a transaction, which is even more aggressive than you're suggesting. The reason this problem still exists is because we also account for locally pending or confirmed transactions, since the user may have sent transactions via a different provider, and the current one may not be aware of the latest state yet (a surprisingly common problem when serving against load-balanced providers).

mryellow commented 6 years ago

seed phrase makes it complicated

Yeah did notice that bit re-reading tickets. Perhaps can be a feature on "loose" accounts only.

danfinlay commented 6 years ago

Perhaps can be a feature on "loose" accounts only.

That's why #2638 exists.

wbt commented 6 years ago

We're talking about situations where the backend blockchain is reset while MetaMask is connected to it.

I'm encountering the situation where the backend blockchain is reset even while MetaMask is not connected to it: between the time I close Chrome at the end of one session and reopen it at the start of the next.

MetaMask is apparently not checking the sending account's tx count when I open Chrome and enter my MetaMask password, nor is it checking before attempting to send a transaction (or at least it doesn't seem to be using that information in deciding what nonce to send). Fixing either of those would seem like a step in the right direction.

I agree that checking only on login is probably not the best solution, and would probably get tired of having to cycle through (open MetaMask -> open unlabeled menu -> Log Out -> enter password) every time the backend blockchain was reset. I don't think that should be the only time MetaMask checks for and handles resets. However, it does seem like a logical place for one such check/handling, and a huge improvement over requiring uninstall/reinstall or running fake transactions as workarounds.

danfinlay commented 6 years ago

I'm encountering the situation where the backend blockchain is reset even while MetaMask is not connected to it: between the time I close Chrome at the end of one session and reopen it at the start of the next.

Yes, sorry, this is also accounted for. Because of the way we track pending/confirmed transactions to calculate the nonce, closing MetaMask and re-opening it has the same result.

To "reset tx count" on log-in would mean we deliberately "forget" pending and confirmed transactions, which could have dangerous consequences for normal users. Keep in mind, you're developers, and why you're the key to all of Ethereum's success, regular users are the key to its security, and we will not compromise on user safety for developer convenience.

Thread Summary