citp / BlockSci

A high-performance tool for blockchain science and exploration
https://citp.github.io/BlockSci/
GNU General Public License v3.0
1.34k stars 259 forks source link

Where is the pubkey field deduced from? #365

Closed crypto-perry closed 4 years ago

crypto-perry commented 4 years ago

I would like to understand if accessing the pubkey for an address which was not yet revealed but for which a pubkey exists from another address will return None or not. The only way I see this as being possible is to take every pubkey you see and generate all possible standard addresses (p2pk / p2pkh / p2sh / p2wpkh / p2wsh) and keep an index from them to the pk. Thus even UTXO addresses could be linked to a pk that you have seen before even though this specific address was not revealed before.

Example: If I use public key pkA to

  1. create an output of a certain type (p2pk/p2pkh/p2sh/p2wpkh/p2wsh) -> call this addrA
  2. create a second output with a different type -> call this addrB

I now spend from addrA and thus reveal pkA which means that addrA.pubkey will return pkA. What will addrB.pubkey return?

I guess my question boils down to: Do you keep an index from possible scripts generated from a public key to the public key?

Thank you very much! Amazing tool you are building here!

mplattner commented 4 years ago

Address data is internally de-duplicated and the pubkey is shared between compatible address types. Thus, in your example, addrB.pubkey will return the pubkey pkA, as it was revealed and stored when addrA was spent. (Assuming addrA is of type p2pk, p2pkh, p2wpkh or part of a multisig address, and not p2sh or p2wsh).

crypto-perry commented 4 years ago

Thanks for the answer! This is what I was assuming as well :) I believe it would be possible to also include p2sh and p2wsh in the equivalence class of a pk by generating all the possible standard p2sh and p2wsh that can be built from that pk. Basically whenever the parser sees a new pk it can generate all the possible address types including p2(w)sh-p2wpkh/p2wsh and maybe more wrapped-address types and index all of these. From my understanding you are already doing this generation from p2pk to p2pkh in order to identify unspent p2pkh as equivalent. I realise this is only one generation instead of the 4 or more I am proposing but is this the only reason this is not done? I am assuming the access to this index is already using a bloomfilter so maybe the extra addresses would not bring such a huge cost... It would just make the tool really complete if you would fully merge the equivalence classes

mplattner commented 4 years ago

If a P2SH address, eg. sh_addr, wraps a P2PK(H) address, then sh_addr.wrapped_address.pubkey should work already.

The docs have more information about equivalence of addresses.

crypto-perry commented 4 years ago

That field is only populated once the p2sh is spent from what I have tested. Actually this is a good point as well: Once the wrapped_address field is populated why is the p2sh not ewuivalent to the the wrapped_address inaide of it? Shouldn't the only criteria for equivalence be the pk used to generate the address?

mplattner commented 4 years ago

To my understanding the required information to spend a P2SH output is the script that matches the hash that the P2SH output is locked to. Thus, a P2SH address does not need a public key to be spent.

I think the P2SH address is already treated as equivalent to the wrapped_address, according to the docs:

Script Equivalence - A Pay to script hash address and the address that is wrapped inside it can be considered equivalent addresses since they reflect the same piece of information.

I am not sure I fully understand your point, maybe @maltemoeser can help out at some point; but changing this behaviour (if at all) is not a priority at the moment.

crypto-perry commented 4 years ago

Yes I understand. I was not expecting this behaviour to change. I was just hoping to understand better the idea of equivalence as implemented by blocksci.

A P2SH output does need the pk in order to be spent because otherwise the signature could not be checked. It is true indeed that clients that have not upgraded(if any still exist) will be satisfied with just seeing the revealed script, but new upgraded(basically all) clients also check the contents of the revealed script, hence a public key is needed.

From my testing a p2sh.revealed_tx is different than p2sh.wrapped_address.revealed_tx, so this means that p2sh and its wrapped address are not really equivalent... So if the quote you mention is indeed true how come the revealed_tx is different for p2sh and p2sh.wrapped_address?

Basically my point is that if you know a public key, you can also know all possible standard addresses generated from it such as: p2pkh, p2wpkh, p2sh-p2pkh, p2sh-p2wpkh. The first 2 you are already covering, I was wondering why not the last 2 as well...

maltemoeser commented 4 years ago

Basically whenever the parser sees a new pk it can generate all the possible address types including p2(w)sh-p2wpkh/p2wsh and maybe more wrapped-address types and index all of these. From my understanding you are already doing this generation from p2pk to p2pkh in order to identify unspent p2pkh as equivalent.

That would require quite a bit of excess storage and computation for the hundreds of millions of pubkeys that have never been used in a P2SH. It's certainly possible to do, but IMO better suited as a dedicated analysis rather than being baked into BlockSci.

I think the reason we did P2PK -> P2PKH was that there's no standard format for P2PK and most block explorers just showed the corresponding P2PKH address. Which unfortunately has also led to some confusion in the past (https://github.com/citp/BlockSci/issues/253#issuecomment-499463178, https://github.com/citp/BlockSci/issues/322#issuecomment-530349742, https://github.com/citp/BlockSci/issues/187#issuecomment-433055124)


There are two types of equivalence, the script equivalence would be the one where the wrapped address is equivalent to the wrapping address.

From my testing a p2sh.revealed_tx is different than p2sh.wrapped_address.revealed_tx, so this means that p2sh and its wrapped address are not really equivalent... So if the quote you mention is indeed true how come the revealed_tx is different for p2sh and p2sh.wrapped_address?

It can be different when the wrapped address had been used before, though I'd expect it to be the same generally. E.g.,

address = chain.address_from_string("3DoXJ8gutVmEC1UeiaSTASg6L2ZpU3aiS4")
address
> ScriptHashAddress(3DoXJ8gutVmEC1UeiaSTASg6L2ZpU3aiS4)
address.revealed_tx.index
> 253697029
address.wrapped_address
> MultisigAddress(2 of 3)
address.wrapped_address.revealed_tx.index
> 253697029
crypto-perry commented 4 years ago

I understand, thanks for all the clarifications! I guess the size would be a problem of course, but I was wondering if that is the only reason :)