ethereum / EIPs

The Ethereum Improvement Proposal repository
https://eips.ethereum.org/
Creative Commons Zero v1.0 Universal
12.89k stars 5.28k forks source link

Pattern matching through a precompiled contract. #125

Closed androlo closed 2 years ago

androlo commented 8 years ago

I would like to propose that pattern matching is added to Ethereum as a precompiled contract. The call would basically take a string and pattern as arguments (both in the form of strings), the pattern would be compiled (if needed), the matching done, and the result would be the return value(s). More specifics would have to be discussed, e.g. should there be additional arguments for max number of matches, index of each match, etc.

Motivation

Pattern matching are built in features in most programming languages. A precompiled contract would make it possible to do fast and safe input validation. This fits in well with the current focus on security and validation. I've actually been at this for a while, and even done some experimentation in Solidity with NFA etc. but I don't think it is plausible so I dropped it.

Pre-compiling expressions would of course not be possible, but this could still be a useful feature.

Gas

I don't know how metering would work, but perhaps it could be based on the length of the string, and maybe some sort of quick estimate of the complexity of the pattern. This would require some thought.

holiman commented 8 years ago

Regexps are huge attack surfaces. Lua-style regexps less so than Pcre, but still.. Also difficult to gas-meter on native implementation, since input size is almost not relevant as a measure of computational complexity.

Some info about Lua patterns

Unlike several other scripting languages, Lua does not use POSIX regular expressions (regexp) for pattern matching. The main reason for this is size: A typical implementation of POSIX regexp takes more than 4,000 lines of code. This is bigger than all Lua standard libraries together. In comparison, the implementation of pattern matching in Lua has less than 500 lines. Of course, the pattern matching in Lua cannot do all that a full POSIX implementation does. Nevertheless, pattern matching in Lua is a powerful tool and includes some features that are difficult to match with standard POSIX implementations.

I don't think PCRE regexp complexity can be calculated beforehand, at least not accurately enough to protect against malicious input explicitly crafted for denial-of-service.

androlo commented 8 years ago

Thanks.

I should be clear, i don't mean it has to be actual full-on regexp, but some form of pattern matching that can be used to process strings, to avoid having to use string parsing libraries written in Solidity. The cost of doing that is simply too high. If there is no regexp or regexp-like libraries where the time-complexity of compilation and matching can be calculated in a straight-forward way then it must of course use something simpler. I don't think that would be a problem at all.

I edited the description.

androlo commented 8 years ago

Will look into LUA patterns. That looks like maybe an alternative. Again, thanks!

wanderer commented 8 years ago

This should be implemented as a library in solidity not as a precompiled.

androlo commented 8 years ago

@wanderer Do you mean as actual Solidity code? I and others have tried, but even basic string ops like simple comparisons costs a lot of gas - several times the cost of the rest of the call in many cases. Anything that requires non trivial string parsing I think would be way over the top.

Thanks for pointing out. I will get some existing string processing library code and give some examples.

androlo commented 8 years ago

This is a super simple contract, derived from the stringutils library in dapp-bin. It does a simple equals check for each character in a 210 char string. The cost of execution is ~100k gas. For 32 chars it ends up being around 12k (both in latest browser solidity, can be checked). Cost can be lowered by using inline assembly, and in this simple equals check it can of course be optimized to check one word at a time, but if comparisons needs to be run for each char this is the simplest possible check. Also, assembly would of course make formal analysis very hard. With more advanced parsing it would be many times this cost, and even short strings would be very expensive.

contract StringUtils {

    /// @dev Compares two strings and returns true iff they are equal.
    function equal(bytes a, bytes b) constant internal returns (bool) {
        uint length = a.length;
        if (b.length != a.length) {
            return false;
        }
        for (uint i = 0; i < length; i ++) {
            if (a[i] != b[i])
                return false;
        }
        return true;
    }

    function testMultiWordSuccess() constant returns (bool) {
        return equal(
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij",
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij"
        );
    }

}
gcolvin commented 8 years ago

Perhaps the real need is support for more efficient string processing in the EVM.

androlo commented 8 years ago

@gcolvin Perhaps.

This I wrote in the gitter chat, I feel like it's a good complement to the issue:

The crux is that string validation is incredibly expensive, which means people resort to their own non-standardized, hand made stuff that specialize on one particular thing, or simply avoid things that would involve pattern matching altogether.

A non trivial Solidity contract for pattern matching simply is not possible. It is essentially the same thing Vitalik is proposing for general elliptic curve arithmetic, and the argument is the same - technically you can do it in Solidity, but the gas costs are prohibitive, so a precompiled contract with some sort of useful system could be a solution (regardless of what system is actually used).

This is not really a theoretical issue, but a purely practical one.

gcolvin commented 8 years ago

Right @androlo Do you have an idea, practically, of what EVM bytecodes we would need to make things like this efficient? Vague memories of the 8086 string operators come to mind. And the difficulties of byte-processing on word-addressed architectures like the PDP-7 and Alpha. And more general problem of dealing efficiently with arrays of numbers that are narrower than 256 bits.

androlo commented 8 years ago

I don't know. I can't comment on the pros and cons of additional EVM components, which is why I put a precompiled as the alternative. I'm basically just a Solidity coder that identified a potential issue, hoping that people who knows better can help get it right.

holiman commented 8 years ago

I think a better approach might be to point to a few usecases where this would be needed. That way, we could focus on narrower issues:

a) Alternative workarounds (off-chain processing) b) Isolating minimum required string processing functionality

I think it's easier to discuss a concrete problem, and not just a feature "because reasons".

gcolvin commented 8 years ago

@androlo can you point me at Vitalik's elliptic-curve proposal? I don't see it in the EIP issues.

@hollman I think the simple string comparison example is an obvious use case. I'd love to see more.

holiman commented 8 years ago

Ok, but for string comparison, there's also the possibility to compare the SHA3-hashes. The cost of that also scales with input size, but if it's only comparison for it's own sake, it should fit the bill, no?

contract StringUtils {

    /// @dev Compares two strings and returns true iff they are equal.
    function equal(bytes a, bytes b) constant internal returns (bool) {
        uint length = a.length;
        if (b.length != a.length) {
            return false;
        }
        for (uint i = 0; i < length; i ++) {
            if (a[i] != b[i])
                return false;
        }
        return true;
    }

    function eq2(bytes a, bytes b) constant internal returns(bool){
        bytes32 sa = sha3(a);
        bytes32 sb = sha3(b);
        return sa == sb;

    }

    function testMultiWordSuccess() constant returns (bool) {
        return equal(
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij",
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij"
        );
    }
    function testMultiWordSuccessSha() constant returns (bool) {
        return eq2(
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij",
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij"
        );
    }
    function testMultiWordFailSha() constant returns (bool) {
        return eq2(
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghij",
            "abcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghijabcdefghix"
        );
    }

}

testMultiWordSuccess: execution cost 49703 gas testMultiWordSuccessSha: execution cost 1027 gas testMultiWordFailSha: execution cost 1049 gas

So I'm wondering about the underlying motives, since I'm guessing it's more than just wanting to check string equality behind this EIP?

androlo commented 8 years ago

Yes you can hash them, but yeah it's about more then comparison.

Let's say for example you have an oracle, expecting input data to be on a specific format. You have your off-chain processing, but that is of course only good if people actually use your dapp rather then transacting to the contract directly, in which case the data could be on any given format. The parser might be vulnurable to certain strings, and you want to check for certain patterns to avoid that. I think this isn't about special cases, just the usual issues. Wanting to check strings is not weird or controversial, is it?

I also know that strings is usually not the best way of doing things; in fact i never use them myself, since one or more bytes32 is usually better and a lot cheaper. I don't think people will necessarily work that way though, given how things has been up to now.

I'm also not dead set on this, there's probably more important, more immediate things that needs attention, but still.

Edit: TL;DR - I think people gonna do lots of string stuff regardless of what people think, and they will get rekt out there if there are no safeguards.

androlo commented 8 years ago

@gcolvin https://github.com/ethereum/EIPs/issues/102

gcolvin commented 8 years ago

Thanks Andreas.

androlo commented 8 years ago

Looks like there are pathological patterns in that LUA script as well, though I haven't found a good reference. Guessing maybe some simple matching with wildcards etc. would be enough in most cases. Something at the level of those search patterns that are used when querying webservers. Perhaps it can even be made to run gas-efficiently in Solidity.

Gonna leave the issue open for now.

subtly commented 8 years ago

Agree with @wanderer

Remember... all the code is on the chain, can be loaded, and then later be "promoted" within the solidity ecosystem. One of the best paths forward is to try and build & test these solutions with what we have now, and then figure out how to promote and optimise after the functionality is refined. Think of it like cryptography – ethereum code is only as strong as the weakest link and I think we can build code that is as strong as the cryptography that backs its state.

gnidan commented 7 years ago

Shameless plug: I've implemented a compiler for converting regular expressions into Solidity (https://github.com/gnidan/solregex)

Not a pre-compiled contract, but hopefully a useful solution for getting on-chain pattern matching.

github-actions[bot] commented 2 years ago

There has been no activity on this issue for two months. It will be closed in a week if no further activity occurs. If you would like to move this EIP forward, please respond to any outstanding feedback or add a comment indicating that you have addressed all required feedback and are ready for a review.

github-actions[bot] commented 2 years ago

This issue was closed due to inactivity. If you are still pursuing it, feel free to reopen it and respond to any feedback or request a review in a comment.