Handling multiple files and entry points in Python

Context

In Python, here is an example of how a sign function is called (case of elliptic curves):

private_key = ec.generate_private_key(ec.SECP384R1())
sig = private_key.sign(digest, ec.ECDSA(hashes.SHA256()))

And here is how a verify function is called:

public_key = private_key.public_key()
public_key.verify(signature, data, ec.ECDSA(hashes.SHA256()))

We see that some crypto information is located at the level of the private key generation (here the chosen curve), and some other information, more closely related to the sign/verify algorithm , is located in the sign/verify function call (here the signing algorithm and its hash function).

In the case of verify, in most cases we will not even generate the private key, but we will receive the public key from someone else. In this case, we cannot use generate_private_key as an entry point for a rule identifying verify, and we may lack the curve information completely.

General problem: Because crypto information is distributed among several function calls, that could be located in different files, it is challenging to link them to aggregate all the information in one place. It may even be impossible to obtain all the information in some cases.

Connecting multiple files

Currently, the SonarQube Python Plugin seems to create an AST for each Python file, without linking symbols that were imported from another file. For now, let's consider that only generate_private_key is the entry point of our sign detection rule which contains a depending detection rule to detect the sign function.

Obtaining arguments values

# file1.py
from file2 import custom_sign1
hash = hashes.SHA256()
sig = custom_sign1(digest, hash)

# file2.py
def custom_sign1(digest, hash):
   private_key = ec.generate_private_key(ec.SECP384R1())
   return private_key.sign(digest, ec.ECDSA(hash))

In this example, our rule correctly detects both generate_private_key and sign in file2.py. It can retrieve all crypto information easily, except for the hash variable that is a parameter of the enclosing wrapper function custom_sign1. This case is already handled at the scale of one file: we look for all the calls of custom_sign1 in the file, and when we find one, we resolve its hash argument.

Problem 1: However, here, custom_sign1 is called from a different file file1.py, that we don't see when we go over the AST of file2.py to search for function calls. In this case, the value of hash will not be resolved.

Depending rule in a subfunction

# file1.py
from file2 import custom_sign2
private_key = ec.generate_private_key(ec.SECP384R1())
sig = custom_sign2(private_key, digest)

# file2.py
def custom_sign2(private_key , digest):
   return private_key.sign(digest, ec.ECDSA(hashes.SHA256()))

In this example, our rule detects the entry point generate_private_key in file1.py. The depending detection rule then look for a sign function call, but because the AST does not contain information about the content of file2.py, it cannot resolve its associated crypto values. Note that this problem would probably also happen if the calls to generate_private_key and sign were inside different functions in the same file.

Problem 2: The scope in which a depending detection rule looks for a match is too limited, as it does not look into the content of other called functions, whether these functions are imported from another file or not.

This example could be even more complicated: we could imagine a codebase where all crypto calls would be inside custom wrappers, so here both generate_private_key and sign would be in different functions in different files, but would still need to be linked somehow.

Importing an external value

# file1.py
from file2 import crypto_dict
private_key = ec.generate_private_key(crypto_dict['intermediate'])

# file2.py
crypto_dict = {'beginner': ec.SECP384R1(),
               'intermediate': ec.BrainpoolP256R1(),
               'advanced': ec.SECT233K1()}

In this example, we only look at the detection of generate_private_key. Our rule will identify the function call, and will then try to resolve the value of the argument to obtain the curve.

Problem 3: The AST contains no information about the content of crypto_dict, so we cannot resolve the value of the curve. Currently, we will resolve the value of string index, in this case intermediate.

When even connecting files is not sufficient

Let's suppose now that we have successfully "connected" all of our files, hence the 3 problems above are solved. Let's go back to our initial verify example:

public_key.verify(signature, data, ec.ECDSA(hashes.SHA256()))

If this public_key is generated by our code, like below, then we may manage to resolve everything.

private_key = ec.generate_private_key(ec.SECP384R1())
public_key = private_key.public_key()

However, as explained previously, this may be rarely the case: in the cases of verifying a signature or encrypting a message with a public key, we mostly expect the public key to come from another protocol participant. Therefore, the generation of the private and public key will be probably unreachable by static analysis, as there will be a phase of communication between the key generation (by Alice) and the signature verification or public key encryption (by Bob).

In this case, we can abandon the idea of obtaining information linked to the key generation (like the curve) when detecting the verify function. However, we would still want to obtain information linked to the verify function call (like the signature algorithm and hash function), which we currently do not get as we didn't detect the entry point generate_private_key.

Problem 4: Because verify is not an entry point (it is only a depending detection rules that is applied upon a detection of generate_private_key), we currently cannot resolve any crypto information from the verify function call. The problem is more complex than just making verify an entry point rule. Indeed, we do not know whether we will detect a generate_private_key entry point or not, which lead to a problem in all cases:

Having sign as a depending detection rule of generate_private_key: case described above.

Having sign only as an entry point: we will correctly identify the crypto information related to the sign function call, but we will never get information related to the key generation (like the curve).

Having sign as an entry point and as a depending detection rule of generate_private_key: we will get all the crypto information, but when generate_private_key detects something, the sign rule will be applied twice and we will get duplicate results.

Having sign as an entry point with generate_private_key as a depending detection rule: this would solve our problem if these were the only rules, however we have several other cases were we need generate_private_key to be an entry point. Then if sign detects something, generate_private_key will be applied twice (one as depending detection rule of sign and once as the entry point for the other rules).

Draft ideas for improvement

Make scanning connect multiple files instead of scanning each file independently (to create an AST bigger than just one file) [to help in problems 1, 2, 3]
- Possibility 1: Work at the SonarQube Python Plugin level to change the scanning process (probably long/hard work)
- Possibility 2: Find an "easy" fix which does not require changing the scanning process, but that could be a pre-processing (merging several files into one?) or post-processing (aggregating and linking all ASTs?)
Improve the way we look for depending detection rules to look into the content of called functions and more [to help in problem 2]
Revamp how depending detection rules and entry points work [to solve problem 4]

IBM / sonar-cryptography