ethereum / sourcify

Decentralized Solidity contract source code verification service
https://sourcify.dev
MIT License
781 stars 395 forks source link

Don't replace "@" when sanitizing paths #515

Closed FabijanC closed 1 year ago

FabijanC commented 3 years ago

View in Huly HI-438

kuzdogan commented 3 years ago

Was there a particular reason we were replacing '@openzeppelin' with '_openzeppelin' when saving in our repo, like the fs does not allow names with '@'?

@ligi @chriseth

Also discussed in https://github.com/sourcifyeth/org/issues/63

chriseth commented 3 years ago

No, I guess we just did not consider @ to be so popular.

tom2drum commented 1 year ago

Hello, guys!

So, if you do need this sanitation for any reason, can you also make a translation map file as it was proposed here?

kuzdogan commented 1 year ago

@tom2drum Hi, thanks for bringing this up.

Yes we will add a translation map if we do any changes to the metadata, as this would break the metadata hash.

We will handle this and the related issues on the next Milestone https://github.com/ethereum/sourcify/milestone/8

rimrakhimov commented 1 year ago

Hi, @kuzdogan! I would like to notice that special characters replacement may cause problems even if translation map is provided. You can see it in the following example: https://repo.sourcify.dev/contracts/full_match/5/0x027f1fe8BbC2a7E9fE97868E82c6Ec6939086c52/. Here, the two files have the same name (ExternalTestMultiple.sol) but are located in different directories (one of which is project: and another is project_). After sanitizing those two directories become the same, and as a result only one file is stored.

As an example where both contracts are available you can see into: https://repo.sourcify.dev/contracts/full_match/5/0xcA30DBB0Cb79cef0194f6e212B49fa1EB7246b61/

kuzdogan commented 1 year ago

Hi Rim.

Thanks for bringing this up. This also breaks the IPFS hash of the metadata, even though the contract is marked as a full_match. Here, normally the metadata must load as we pin it: https://playground.sourcify.dev/?address=0x027f1fe8BbC2a7E9fE97868E82c6Ec6939086c52&chainId=5

We were waiting for couple more things to do a full review of the repository but we should prioritize this @marcocastignoli

kuzdogan commented 1 year ago

I thought we were modifying the metadata.json file which would change the file's hash but apparently, we are not doing that but only changing the folder names in the filesystem. For some reason the examples provided can't be fetched from IPFS, for which I created another issue.

However, as others said, the sanitization breaks the mapping "path in metadata.json" --> "actual path" and without a translation file there's no way how the files are renamed/moved.

1) We need to minimize the set of characters we are "sanitizing" such as @ : etc. which are safe to use in file paths 2) Figure out how Solidity VFS outputs the file paths (considering older versions too). I suppose it does some sort of sanitization on paths which would make things easier for us. 3) Regardless if the compiler does sanitization or not, we still need to sanitize as otherwise it would allow arbitrary access to our filesystem when someone provides a malicious modified metadata.json to us. Ideally, there should be a library or utility to handle this problem in Node.js.

An alternative @marcocastignoli mentioned is to urlencode'ing the path but I believe that is a little too restrictive as the URLencoding does not allow many characters such as @ or _.

But let's not wait for fixing : and @. We already procrastinated on this for too long.

kuzdogan commented 1 year ago

@rimrakhimov Curious, was this an example you came up with or something you observed in practice?

rimrakhimov commented 1 year ago

That was an artificial example I come up with when researching the ways to handle such mapping ourselves via looking for the valid file path inside the metadata, as described in https://github.com/blockscout/blockscout/issues/7648

Just got an idea that it may not always be one-to-one correspondence and came up with the example

marcocastignoli commented 1 year ago

by url encoding the path I meant:

kuzdogan commented 1 year ago

URLEncoding is fine as an example but not a solution. We should not change the file names as much as possible.

Another implication of this, I realized, is we currently don't allow non-ASCII filenames, they are also "sanitized". For Unix it should be fine with any characters and there should be no need to "sanitize" special characters. Windows is another issue..

I just wonder if the compiler allows relative paths in the metadata, something like ../../../../sources/. If that is the case, we need to indeed do some path translation and save that information.

marcocastignoli commented 1 year ago

this is how solidity handles ..: https://docs.soliditylang.org/en/latest/path-resolution.html#id12

kuzdogan commented 1 year ago

Here are some weird "source unit name"s (as Solidity docs refers to) I found with a simple repo scan:

Spirale�~^�~O~Q/spirale.sol
myToken..sol
browser/^]BotContract.sol
PunchOut!.sol
/Users/liuxian/data/学�| /以太�~]~J/MintCoin/contracts/ERC20/ERC20FixedSupply.sol
MyTokenTotoro!.sol
Bonny Finance..sol
../App/contract/ERC20/Cham.sol
contracts/Walrus, Inc..sol
browser/Advance Bot..sol
BabyNami..sol
browser/token'test.sol
MyTokenTotoro!.sol
browser/multi_pool..sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/IOSBtoken/IOSBtoken.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/access/AccessControl.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/math/SafeMath.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/ERC20.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/ERC20Burnable.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/ERC20Pausable.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/IERC20.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/Address.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/Context.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/EnumerableSet.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/Pausable.sol
MetaGold..sol
browser/Advance Bot..sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.solSpirale�~^�~O~Q/spirale.sol
myToken..sol
browser/^]BotContract.sol
PunchOut!.sol
/Users/liuxian/data/学�| /以太�~]~J/MintCoin/contracts/ERC20/ERC20FixedSupply.sol
MyTokenTotoro!.sol
Bonny Finance..sol
../App/contract/ERC20/Cham.sol
contracts/Walrus, Inc..sol
browser/Advance Bot..sol
BabyNami..sol
browser/token'test.sol
MyTokenTotoro!.sol
browser/multi_pool..sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/IOSBtoken/IOSBtoken.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/access/AccessControl.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/math/SafeMath.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/ERC20.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/ERC20Burnable.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/ERC20Pausable.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/token/ERC20/IERC20.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/Address.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/Context.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/EnumerableSet.sol
localhost/../Users/pc/workspace/remixsrc/openzeppelin-contracts/IOSB/utils/Pausable.sol
MetaGold..sol
browser/Advance Bot..sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
1cc0630317fe7f52d6d3a41dbdc596d9/contracts...vault.sol
contracts/Storage..sol
SpaceSoldiers..sol
contracts/Storage..sol
browser/hopdong\.sol
contracts/insurance_smart_contract..sol
contracts/suve/Newcontract2..sol
https://github.com/OpenZeppelin/openzeppelin-contracts/blob/master/contracts/token/ERC20/../../utils/Context.sol
browser/stakingTest \ .sol
inbox..sol
contracts/suve/Newcontract2..sol
browser/hopdong\.sol
contracts/Storage..sol
browser/temp..sol
inbox..sol
https://github.com/OpenZeppelin/openzeppelin-contracts/blob/master/contracts/token/ERC20/../../utils/Context.sol
browser/^]BotContract.sol
../../../Desktop/sushiswap/contracts/SushiToken.sol
contracts/insurance_smart_contract..sol
YEN..sol
YEN..sol
inbox..sol
contracts/NFT it land].sol
omg..sol
omg..sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Address.sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Context.sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Strings.sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Strings.sol
contracts/sample..sol
browser/GSNtest..sol
src\Contract.sol
auth..sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Address.sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Context.sol
https://raw.githubusercontent.com/OpenZeppelin/openzeppelin-contracts/master/contracts/token/ERC721/../../utils/Strings.sol
browser/^HGenToken.sol
omg..sol
newTestContract..sol
ebcf3cce4e6e78f112385f8f5fc2a7ca/contracts...Origin.sol
contracts/sample..sol
browser/GSNtest..sol
contracts/ParrtyBear Nft Collectibles..sol
contracts/old_monk..sol
../App/contract/ERC20/Cham.sol
contracts/sample..sol
localhost/../../../../Volumes/Untitled - Data/Users/james/dev/komodo-notarisation-solidity/contracts/VerusNotarizer/VerusNotarizer.sol
newTestContract..sol
omg..sol
browser/^HGenToken.sol
browser/^HGenToken.sol
omg..sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
https://github.com/cryptonaut420/crypto-corgis-breeder/blob/main/contracts/interfaces/FactoryERC1155.sol?1
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
contracts/TareaCarga..sol
A.TEST21/Avaterra=scam.sol
project:/contracts/a`~!@#$%^&*()-=_+[]{}|\;:'",<>?ÿø±ö«»¿ð�~K�~X��~_~X~@.sol
project:/contracts/Extern@!Te$tW@(ky.sol
YEN..sol
contracts/CrowdFunding..sol
test..sol
project:/contracts/a`~!@#$%^&*()-=_+[]{}|\;:'",<>?.sol
/D/workspace/�~P~H约/truffle_test/contracts/AFOX.sol
TITO'S.sol
/D/workspace/�~P~H约/truffle_test/contracts/AFOX.sol
contracts\RTS1155.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
!6.sol
!6.sol
/Users/fatel/Desktop/defi项�~[�/unilever-contracts/contracts/gov/GovernorAlpha.sol
$.{37}|2{40}|cantbematchedcharacters__

These are full matches so these are the exact outputs of the compiler.

Apparently, the compiler does resolve the paths when files are supplied over the CLI but the source unit names stay as they are with the standard JSON. So we need to assume it can be arbitrary strings.

This again shows another downside of us having the contract repo directly as a filesystem. If we had a DB we could easily give source unit names as identifiers to the files. But yeah, we need to look into how to provide files over IPFS with a DB.

I guess we can: 1) path.resolve or strip remove the ../ parts from the source unit name and create a translation file 2) Don't sanitize any other characters

The problem with the normalization is it does not ultimately remove the leading ../s e.g.

path.normalize("localhost/../../../../Volumes/Untitled - Data/Users/james/dev/komodo-notarisation-solidity/contracts/VerusNotarizer/VerusNotarizer.sol")

outputs

'../../../Volumes/Untitled - Data/Users/james/dev/komodo-notarisation-solidity/contracts/VerusNotarizer/VerusNotarizer.sol'

so I guess we need to do both

cameel commented 1 year ago

"source unit name"s (as Solidity docs refers to)

It's called like this because what user sees as a "path" is not really a path in a general case. It's a unique identifier assigned to a particular source unit by the compiler. It comes from SourceUnit, which is always the root node of the AST produced from a given source file.

the source unit names stay as they are with the standard JSON. So we need to assume it can be arbitrary strings.

Yes. Source unit names can be arbitrary byte sequences. Normally it's going to be something more or less resembling a path, but you can't rely on that. It can have special characters, Unicode characters, invalid encodings of Unicode characters, invisible control characters. It can even be an empty string. It can be a path that is fine on some system but not another. When Unicode is used, it can be in any encoding, not necessarily UTF-8. It can be an URL (if the tool supports it, like Remix) or have a protocol-like prefix (e.g. Truffle prefixes files stored under contracts/ with project:).

Here are some weird "source unit name"s (as Solidity docs refers to) I found with a simple repo scan:

If you use Standard JSON, you can directly specify source unit names. They can be anything you want and are not normalized or modified in any way. These names do not have to have any relation to the filesystem whatsoever. The compiler does not force you to make them look like paths, but if you do that, you can then rely on the compiler resolving relative imports containing ./ and ../ segments in a reasonable way (that's meant to work even with stuff like URLs or Truffle's prefixed paths).

The complications start when you need to feed the compiler contracts already stored in a filesystem, and that happens in several situations. One of them is giving paths on the CLI - then the compiler tries to assign source unit names for you based on actual paths, in a way that will give you the same source unit name for the same physical file on disk, and will stay the same across different platforms. That source unit name also has to match the way a user would refer to that file in an import. To do that the compiler does transform the paths given on the CLI. It tries to make them relative to your working dir (or base path) and not include anything system specific, like drive letters on Windows, names of UNC shares or the name of your home directory. This is annoyingly complicated and never foolproof - there are situations where you simply cannot do that - e.g. if you give it files stored on two different Windows drives.

It's also only done since 0.8.8, where we made that normalization more regular, actually compatible with --allow-paths and remappings, and fixed lots of corner cases that were dependent on the platform or version of the Boost library the compiler is built with. Before then it was a mess. On 0.8.8+ for contracts compiled on the CLI you should almost always get pretty well behaved paths: relative, without drive letters, without unnormalized ./ or ../ segments, without consecutive slashes and actually usable as paths. They may still contain unsafe filesystem chars, Unicode chars (in encoding dependent on your system) or anything else that may appear in an actual path though.

Now, contracts compiled by frameworks and tools are a different matter. They pretty much always store the contract on an actual filesystem (Truffle, Hardhat, Foundry) or in a database in a way that still uses paths to identify files (Remix). The problem is that they have their own rules of now to put that into Standard JSON. In that case translating paths to source unit names is up to the tool and rules may be completely different from compiler's. For example some frameworks may use absolute paths (which is discouraged), others relative paths, yet others may use a prefix like Truffle. Some may sanitize them, some may not. Some may resolve symlinks, some may not. Some may only allow things from the filesystem, others may allow importing from URLs, IPFS hashes, or even source files generated on the fly (e.g. interfaces automatically generated from ABI JSON). It does not create any issues as long as it's kept inside Standard JSON, but if you want to translate it back to how files where stored originally, that's tough, differs between tools and in some cases may be even impossible/ambiguous.

This again shows another downside of us having the contract repo directly as a filesystem. If we had a DB we could easily give source unit names as identifiers to the files. But yeah, we need to look into how to provide files over IPFS with a DB.

I think that if you want to handle this in full generality, without risking that you won't be able to compile something due to weird paths, the only sane solution is to have some abstraction layer between names used in Standard JSON and names under which you store things in the filesystem. URL-encoding would be one way to do this, the main downside being that in your URLs the source unit names will stop looking like paths. You might want to come up with some slight variation to get more sensible URLs in simple cases. E.g. it would be more readable if slashes were not encoded.

kuzdogan commented 1 year ago

This was a difficult tradeoff but the final PR https://github.com/ethereum/sourcify/pull/1132 does the following:

The tradeoff with not URI encoding is the files with source unit names containing non-URI compatible characters (whitespace, non-ascii etc.) or that have reserved URI characters (#) will not be accessible directly through static serving at the endpoint /repository/contracts/full_match/11155111/0x358e70D1d7edc4287B2d083a2ceB3EA95E363A58/{sourceUnitName}.

I've opted for this because:

  1. I can see from the usage that the preferred way to fetch contract files is not this endpoint but rather /files/{chain}/{address}.
  2. The range of non-URI characters is quite large. That'd mean a really large set of contracts' paths need to be translated. Anything with whitespace, -, _ etc. Having path translations too often does feel complicated.

The other small downside is that file paths are not Windows compatible.

Ultimately, we've learned that we can't assume source unit names to. be paths and a full solution would be having an abstraction layer. This also feels like we need to go in having a DB direction and maybe separately serve the files on IPFS.