The code seems incomplete

YinHuiLin commented 3 years ago

Hi, I read your paper and think it would be very helpful for my work. I want to konw how to process the sard samples after ast serialization, because a c file may contains several functions. @DanielLin1986 Hope your reply, thank you!

DanielLin1986 commented 3 years ago

Hi there,

Thanks for your comments.

I used a Python file to extract C functions from a C file. The Python file can be found from the following link: https://github.com/DanielLin1986/function_representation_learning/blob/master/Code/ExtractCFunctionByName_v2.py. However, this file is buggy and it requires you to know the name of the function to be extracted. You can use a static tool to obtain all the names of functions in a C file, then use the function names to extract function bodies. Due to the file being buggy, the extracted function code may have issues (i.e., having extra "}" or missing return types ). Further processing is needed.

To obtain representations from a hidden layer of a neural network can be achieved using the following code (tested on Tensorflow 1.14 and 1.15 with Keras 2.2.4):

def ObtainRepresentations_by_batch_size(input_sequences, layer_number, model, BATCH_SIZE):
    num_batches_per_epoch = int((len(input_sequences) - 1) / BATCH_SIZE) + 1
    data_size = len(input_sequences)
    representations_total = []
    for batch_num in range(num_batches_per_epoch):
        start_index = batch_num * BATCH_SIZE
        end_index = min((batch_num + 1) * BATCH_SIZE, data_size)
        print ("-------start_index------------")
        print (start_index)
        print ("-------end_index------------")
        print (end_index)
        layered_model = Model(inputs = model.input, outputs=model.layers[layer_number].output)
        representations = layered_model.predict(input_sequences[start_index: end_index])
        representations_total = representations_total + representations.tolist()
    return np.asarray(representations_total)

YinHuiLin commented 3 years ago

Thank you! It is so nice of you,! Now I am reproducing some part of your work introduced in your paper, and I want to compare my results to yours. Can you explain more about how to calculate top K Precision and Top K Recall for a better comparison? I am new to this and don't know how to calculate these metrics. Hope your reply soon.

DanielLin1986 commented 3 years ago

Thank you! It is so nice of you,! Now I am reproducing some part of your work introduced in your paper, and I want to compare my results to yours. Can you explain more about how to calculate top K Precision and Top K Recall for a better comparison? I am new to this and don't know how to calculate these metrics. Hope your reply soon.

Hi there! You are welcome.

Recall that we use "1" to refer to vulnerable and "0" to non-vulnerable.

Based on the code, the results you get are a list of probabilities. If you include the column of labels you will see the following as an example (k=4):

ID probs labels func_1 0.0011 0 func_2 0.8917 0 func_3 0.2513 0 CVE 0.9584 1 ..... ......

Then, what you have to do is to import the above data to Excel and rank the list based on the "probs" from the largest to smallest. You will get:

ID probs labels CVE 0.9584 1 func_2 0.8917 0 func_3 0.2513 0 func_1 0.0011 0 ..... ......

Next, count the top-k (in this example, it is top-4 since there are 4 rows in the example). Based on the labels, we can see that the actually vulnerable function is the "CVE", so we know that only the "CVE" is actually found as vulnerable, and func_2 is wrongly treated as vulnerable (it has a probability of 0.8917 which is very close to 1). So, top-4 precision is 1/4 (that is to say, 4 samples are returned, 1 is actually vulnerable).

Suppose that there are 10 vulnerable functions in total, the top-4 recall is 1/10, meaning that returning 4 functions can only identify 1 out of 10 actually vulnerable functions.

DanielLin1986 / RepresentationsLearningFromMulti_domain

The code seems incomplete #2