Step 1 - Training Neural Network stopped at Elmo

DanielLin1986 / Function-level-Vulnerability-Detection

A deep learning-based vulnerability detection framework

73 stars 23 forks source link

Step 1 - Training Neural Network stopped at Elmo #2

Closed anhhaibkhn closed 3 years ago

anhhaibkhn commented 4 years ago

Dear Sir, Great work! Thank you for sharing the project. I have been following your work for a while, it would be great to learn to extend this benchmark for other neural networks. Also, I ran into a bit of trouble when running step 1, the log is posted as the following part. After training the Word2Vec, it moves to Elmo model and somehow stuck there. Could you suggest what could be mistaken here?

----------------------------------------
Start training the Word2Vec model. Please wait..
Model training completed!
----------------------------------------
The trained word2vec model:
Word2Vec(vocab=1886, size=100, alpha=0.025)
-------------------------------------------------------
Loading trained Word2vec model.
The trained word2vec model:
<_io.TextIOWrapper name='embedding/w2v_model.txt' mode='r' encoding='cp65001'>
Found 1887 word vectors.
[INFO] Word2vec loaded!
[INFO] Pad the sequence to unified length...
[INFO] Patition the data ....
[INFO] Data processing completed!
[INFO] -------------------------------------------------------
[INFO] There are 397 total samples in the training set. 28 vulnerable samples.
[INFO] There are 100 total samples in the validation set. 8 vulnerable samples.
[INFO] -------------------------------------------------------
WARNING:tensorflow:From E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\helper.py:193: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2020-04-16 21:59:04.212259: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
[INFO] No GPU detected.
[INFO] Using CPU for training. It may take considerable time!
elmo
[INFO] Loading the elmo model.
Traceback (most recent call last):
  File "main.py", line 35, in <module>
    helper.exec()
  File "E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\helper.py", line 242, in exec
    model_func = elmo_network.build_elmo_network(GPU_flag)
  File "E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\models\elmo_network.py", line 43, in build_elmo_network
    elmo_embedding = Lambda(self.make_elmo_embedding, output_shape=(None, 1024))(elmo_input)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\backend\tensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\base_layer.py", line 489, in __call__
    output = self.call(inputs, **kwargs)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\layers\core.py", line 716, in call
    return self.function(inputs, **arguments)
TypeError: make_elmo_embedding() takes 1 positional argument but 2 were given

anhhaibkhn commented 4 years ago

I managed to get through the above error, by adding "self" to this line def make_elmo_embedding(self,x) in the Class elmo_model then after that, inside that Class too, I try to return the model for running function Summary() in "helper" but, it shows different error now, May I ask how to set the training process to not use "Elmo" and try out with other models first ? Also, would it be possible to upgrade the scripts for tensorflow 2.x ? Thank you for reading my message. Model: "Elmo_network"

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, None)              0
_________________________________________________________________
lambda_1 (Lambda)            (None, None, 1024)        0
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 256)         1180672
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 256)         394240
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256)               0
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_1 (Dense)              (None, 64)                16448
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65
=================================================================
Total params: 1,591,425
Trainable params: 1,591,425
Non-trainable params: 0
_________________________________________________________________


C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:92: UserWarning: The TensorBoard callback `batch_size` argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
  warnings.warn('The TensorBoard callback `batch_size` argument '
C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:97: UserWarning: The TensorBoard callback does not support gradients display when using TensorFlow 2.0. The `write_grads` argument is ignored.
  warnings.warn('The TensorBoard callback does not support '
Train on 397 samples, validate on 100 samples
Traceback (most recent call last):
  File "main.py", line 35, in <module>
    helper.exec()
  File "E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\helper.py", line 275, in exec
    class_weight = class_weights)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training_arrays.py", line 119, in fit_loop
    callbacks.set_model(callback_model)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\callbacks.py", line 68, in set_model
    callback.set_model(model)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py", line 116, in set_model
    super(TensorBoard, self).set_model(model)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1532, in set_model
    self.log_dir, self.model._get_distribution_strategy())  # pylint: disable=protected-access
AttributeError: 'Model' object has no attribute '_get_distribution_strategy'

anhhaibkhn commented 4 years ago

Sorry, I hope this reply finds you well. As I suspect the TensorFlow version may be the problem, I downgrade it to ver 1.14 as you advised on the other topic. However, I still got the following error, it looks like Tensorflow's again.
May I ask would it be possible to skip --call back process or try with other models like LSTM ?

File "main.py", line 34, in <module>
    helper.exec()
  File "D:\Workspace\Research\Function-level-Vulnerability-Detection-master\src\helper.py", line 275, in exec
    class_weight = class_weights)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1705, in fit
    validation_steps=validation_steps)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1236, in _fit_loop
    outs = f(ins_batch)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\backend\tensorflow_backend.py", line 2482, in _call_
    **self.session_kwargs)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.

DanielLin1986 commented 4 years ago

Sorry, I hope this reply finds you well. As I suspect the TensorFlow version may be the problem, I downgrade it to ver 1.14 as you advised on the other topic. However, I still got the following error, it looks like Tensorflow's again. May I ask would it be possible to skip --call back process or try with other models like LSTM ?

File "main.py", line 34, in <module>
    helper.exec()
  File "D:\Workspace\Research\Function-level-Vulnerability-Detection-master\src\helper.py", line 275, in exec
    class_weight = class_weights)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1705, in fit
    validation_steps=validation_steps)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1236, in _fit_loop
    outs = f(ins_batch)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\backend\tensorflow_backend.py", line 2482, in _call_
    **self.session_kwargs)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.

Hi Anhhaibkhn, I am sorry for the late reply. It seems that the "class_weight = class_weights" caused the error. Please remove "class_weight = class_weights" and have a try.

To be compatible with Tensorflow 2.X and Keras 2.3.1, please remove the code that is related to TensorBoard. I am sorry that currently, the ELMo embedding is still not well supported. I will try to fix this ASAP.

DanielLin1986 commented 4 years ago

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, None)              0
_________________________________________________________________
lambda_1 (Lambda)            (None, None, 1024)        0
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 256)         1180672
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 256)         394240
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256)               0
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_1 (Dense)              (None, 64)                16448
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65
=================================================================
Total params: 1,591,425
Trainable params: 1,591,425
Non-trainable params: 0
_________________________________________________________________


C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:92: UserWarning: The TensorBoard callback `batch_size` argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
  warnings.warn('The TensorBoard callback `batch_size` argument '
C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:97: UserWarning: The TensorBoard callback does not support gradients display when using TensorFlow 2.0. The `write_grads` argument is ignored.
  warnings.warn('The TensorBoard callback does not support '
Train on 397 samples, validate on 100 samples
Traceback (most recent call last):
  File "main.py", line 35, in <module>
    helper.exec()
  File "E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\helper.py", line 275, in exec
    class_weight = class_weights)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training_arrays.py", line 119, in fit_loop
    callbacks.set_model(callback_model)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\callbacks.py", line 68, in set_model
    callback.set_model(model)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py", line 116, in set_model
    super(TensorBoard, self).set_model(model)
  File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1532, in set_model
    self.log_dir, self.model._get_distribution_strategy())  # pylint: disable=protected-access
AttributeError: 'Model' object has no attribute '_get_distribution_strategy'

Hi Anhhaibkhn, if you would like to use other embedding methods, for example, to use Word2vec, just specify:

Python main.py --config config/config.yaml --embedding word2vec

To use other types of networks, please change the 'model' tag to 'bilstm' in the configuration file named 'config.yaml'.

anhhaibkhn commented 4 years ago

Dear Sir, Thank you very much for your reply, I have read your paper, and followed your work for a while.

About the embedding method, can you tell me the version of Glove? word2vec and fasttext seem okay to run but not "glove".
May I ask how to construct more datasets from the SARD Juliet suite. As I read from your paper, It was randomly extracted from the Juliet test suits files, but could you explain a bit more of the extraction process?

Thanks in advance.

DanielLin1986 commented 4 years ago

Dear Sir, Thank you very much for your reply, I have read your paper, and followed your work for a while.

About the embedding method, can you tell me the version of Glove? word2vec and fasttext seem okay to run but not "glove".

May I ask how to construct more datasets from the SARD Juliet suite. As I read from your paper, It was randomly extracted from the Juliet test suits files, but could you explain a bit more of the extraction process?

Thanks in advance.

No worries. Thank you for being interested in our project.

Hi Hai, the version of the glove Python implementation that I am using is 1.0.1. Yes. Actually, you can write a crawler to download the C test samples from the SARD Juliet suite. If you would like to have some of the extracted C functions from the SARD, please provide your email address and I can share the data with you.

anhhaibkhn commented 4 years ago

Dear Sir, Thank you very much for your instant reply. That would be a great help to try out with the C test samples from SARD. My email is: nguyenngochaibkhn@gmail.com There was a part in the code which I feel a bit confused, I hope you could help me to understand it. when Glove model was called for training process, it supposed to return an embedding matrix :

            elif embedding_method == 'glove':
                from src.embedding import Glove as Embedding_Model
                embedding_model = Embedding_Model(self.config)
                total_sequences, word_index = embedding_model.LoadTokenizer(total_list)
                embedding_model.TrainGlove(total_list)
                embedding_matrix, embedding_dim = embedding_model.ApplyGlove()

However, in the body of the function, it did not return a matrix, but just a dictionary

   def ApplyGlove(self):
        with open(self.config.tokenizer_saved_path + os.sep + 'glove.model') as f:
            glove_model = pickle.load(f)

        key_list = list(glove_model['dictionary'].keys())
        word_vector_list = glove_model['word_vectors'].tolist()

        embeddings_index = {}
        for index, item in enumerate(key_list):
            word = key_list[index]
            coefs = np.asarray(word_vector_list[index], dtype='float32')
            embeddings_index[word] = coefs
        print('Loaded %s word vectors.' % len(embeddings_index))

        return embeddings_index, self.components

At the moment, I am trying to implement other embedding methods like use pre-trained models of Glove, BERT. However the vector embeddings came from these pre-trained models are not as useful so far since it was trained from language text. I would be really grateful if you could suggest the direction to extend this study further

Thank you very much for your time.

DanielLin1986 commented 4 years ago

Dear Sir, Thank you very much for your instant reply. That would be a great help to try out with the C test samples from SARD. My email is: nguyenngochaibkhn@gmail.com There was a part in the code which I feel a bit confused, I hope you could help me to understand it. when Glove model was called for training process, it supposed to return an embedding matrix :
            elif embedding_method == 'glove':
                from src.embedding import Glove as Embedding_Model
                embedding_model = Embedding_Model(self.config)
                total_sequences, word_index = embedding_model.LoadTokenizer(total_list)
                embedding_model.TrainGlove(total_list)
                embedding_matrix, embedding_dim = embedding_model.ApplyGlove()
However, in the body of the function, it did not return a matrix, but just a dictionary
  def ApplyGlove(self):
       with open(self.config.tokenizer_saved_path + os.sep + 'glove.model') as f:
           glove_model = pickle.load(f)

       key_list = list(glove_model['dictionary'].keys())
       word_vector_list = glove_model['word_vectors'].tolist()

       embeddings_index = {}
       for index, item in enumerate(key_list):
           word = key_list[index]
           coefs = np.asarray(word_vector_list[index], dtype='float32')
           embeddings_index[word] = coefs
       print('Loaded %s word vectors.' % len(embeddings_index))

       return embeddings_index, self.components
At the moment, I am trying to implement other embedding methods like use pre-trained models of Glove, BERT. However the vector embeddings came from these pre-trained models are not as useful so far since it was trained from language text. I would be really grateful if you could suggest the direction to extend this study further

Thank you very much for your time.

Hi, I have sent the data to hai@cysec.cs.ritsumei.ac.jp before, but it seemed that you did not receive it. I have already sent the data to nguyenngochaibkhn@gmail.com. Please check.

Yes, you are right. the Glove model returns a dictionary. The dictionary contains key-value pairs. The keys are the code tokens and the values are generated embeddings whose dimensionality is the embedding_dim.

Great idea for using Glove and BERT for code analysis! I think the difficult part is to bridge the difference between natural languages and the software code. Looking forward to hearing the progress from you.

anhhaibkhn commented 4 years ago

Dear Sir, Thank you very much for your reply, Sorry for the inconvenience, but both of my emails have not received the data C tests from SARD yet. Could you please check again? In my above comment, my point is that after training with Glove model, in order to proceed to the next part (training neural network), should it return the embedding matrix (2D list of vectors) like how it was done with Word2vec and Fast-text instead of returning the dictionary here. The neural network can not proceed to the training stage since one of its arguments seemed incorrect. Thank you for your time. Hai Nguyen.

DanielLin1986 commented 4 years ago

Dear Sir, Thank you very much for your reply, Sorry for the inconvenience, but both of my emails have not received the data C tests from SARD yet. Could you please check again? In my above comment, my point is that after training with Glove model, in order to proceed to the next part (training neural network), should it return the embedding matrix (2D list of vectors) like how it was done with Word2vec and Fast-text instead of returning the dictionary here. The neural network can not proceed to the training stage since one of its arguments seemed incorrect. Thank you for your time. Hai Nguyen.

Hi Hai, I think it was because of the size of the attachment. It exceeded the limit of your email server so it was not delivered. Please find the following Github link for data downloading: https://github.com/cybercodeintelligence/CyberCI or https://cybercodeintelligence.github.io/CyberCI/

I will have a look at the Glove model and will get back to you. ^_^

anhhaibkhn commented 4 years ago

Thank you very much for your reply, Great resources. I will dive right into it. These techniques are extremely useful for a newbie in this field like me.

I look forward to hearing from you.

DanielLin1986 commented 4 years ago

Thank you very much for your reply, Great resources. I will dive right into it. These techniques are extremely useful for a newbie in this field like me.

I look forward to hearing from you.

No worries. I am truly hoping that the resources would be helpful and useful.

With regard to the Glove model, I think the embeddings_index and the components returned by the function ApplyGlove() are the embedding matrix and the embedding size needed for the subsequent processing. The embedding matrix is constructed using the dictionary. Please correct me if I am wrong since I wrote the code a long time ago.

Your comments and suggestions are welcomed.

anhhaibkhn commented 4 years ago

Thank you for your reply,

Yes, however embedding matrix was not yet constructed in this case, since the function ApplyGlove only returns the embedding_index which is a dictionary. So the function may need to be modified a bit to return the embedding matrix.

Also, I could run the code for other embedding methods but right now I can not use GPU to train, I use Cuda version 10.1 and ubuntu 18.04. It seems the incompatible between drivers leads to that. May I ask for your Cuda, tensorflow-gpu version? Sorry that this post is becoming longer than expected. Thank you for your time.

DanielLin1986 commented 4 years ago

Thank you for your reply,

Yes, however embedding matrix was not yet constructed in this case, since the function ApplyGlove only returns the embedding_index which is a dictionary. So the function may need to be modified a bit to return the embedding matrix.

Also, I could run the code for other embedding methods but right now I can not use GPU to train, I use Cuda version 10.1 and ubuntu 18.04. It seems the incompatible between drivers leads to that. May I ask for your Cuda, tensorflow-gpu version? Sorry that this post is becoming longer than expected. Thank you for your time.

Thanks for pointing out the bug. I will have a look.

I have tested the code on Windows 10 with Cuda 10.0. The driver version is 445.87. Tensorflow-gpu version is 1.14. I also tested the code on Windows 10 with Tensorflow-cpu whose version is 2.0.

Cuda 10.1 should be alright. Have you checked the GPU driver? What is the size of the GPU memory?

DanielLin1986 commented 4 years ago

Thank you for your reply,

Yes, however embedding matrix was not yet constructed in this case, since the function ApplyGlove only returns the embedding_index which is a dictionary. So the function may need to be modified a bit to return the embedding matrix.

Also, I could run the code for other embedding methods but right now I can not use GPU to train, I use Cuda version 10.1 and ubuntu 18.04. It seems the incompatible between drivers leads to that. May I ask for your Cuda, tensorflow-gpu version? Sorry that this post is becoming longer than expected. Thank you for your time.

When solving the Glove issue, I just found that there was something wrong with my glove_python installation. But I could not reinstall it. Solutions are not found yet.

anhhaibkhn commented 4 years ago

I was also unable to install glove_python on my windows 10 virtual env - python 3.6.9, I switched to ubuntu with python 3.6.9, I was able to install glove_python. Here are my drivers' installation:

$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.243

nvidia-smi
Tue Jun  9 20:11:37 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8    16W / 250W |    396MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1186      G   /usr/lib/xorg/Xorg                158MiB |
|    0   N/A  N/A      1448      G   /usr/bin/gnome-shell              131MiB |
|    0   N/A  N/A      1901      G   ...mviewer/tv_bin/TeamViewer        2MiB |
|    0   N/A  N/A      2163      G   ...AAAAAAAAA= --shared-files       59MiB |
|    0   N/A  N/A      2729      G   ...oken=11205890301177880090       39MiB |
+-----------------------------------------------------------------------------+

Please take a look, right now I can run temporarily with CPU, since the GPU_Flag can not be set. Thank you for your time

anhhaibkhn commented 4 years ago

tf.test.is_gpu_available() 2020-06-09 20:49:04.371790: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-06-09 20:49:04.403490: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2020-06-09 20:49:04.403715: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2674db0 executing computations on platform Host. Devices: 2020-06-09 20:49:04.403730: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , False , I am trying to downgrade cuda 10.1 to 10.0 followed this link

DanielLin1986 commented 4 years ago

tf.test.is_gpu_available() 2020-06-09 20:49:04.371790: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-06-09 20:49:04.403490: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2020-06-09 20:49:04.403715: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2674db0 executing computations on platform Host. Devices: 2020-06-09 20:49:04.403730: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , False , I am trying to downgrade cuda 10.1 to 10.0 followed this link

Cuda 10.1 is fine. There is no issue with your GPU setting. At least I cannot tell... But the output of the tf.test.is_gpu_available() does not show the GPU device. The following is my output:

>>> tf.test.is_gpu_available() 2020-06-10 11:16:45.576210: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2020-06-10 11:16:45.602357: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll 2020-06-10 11:16:45.936833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-10 11:16:45.941222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7845 pciBusID: 0000:0a:00.0 2020-06-10 11:16:45.941391: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2020-06-10 11:16:45.947961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1 2020-06-10 11:16:53.403187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-10 11:16:53.403271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2020-06-10 11:16:53.403531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N 2020-06-10 11:16:53.403562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N 2020-06-10 11:16:53.449045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 10603 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:09:00.0, compute capability: 6.1) 2020-06-10 11:16:53.470769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 6808 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:0a:00.0, compute capability: 6.1) True

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , Thank you very much for your reply, Sorry, I have been trying to make tensorflow-gpu to run, finally by using cuda-nvidia docker, and put tensorflow-gpu 1.14.0 on top of it, I managed to run the code with GPU. However, at the training process now it showed CUDNN error as below:

[INFO] Model structure loaded.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 100)         22900     
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 1000, 128)         117760    
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000, 128)         0         
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM)     (None, 1000, 128)         132096    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
Total params: 283,125
Trainable params: 260,225
Non-trainable params: 22,900
_________________________________________________________________
Train on 4 samples, validate on 2 samples
W0617 08:48:36.500040 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

W0617 08:48:36.500195 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Epoch 1/100
2020-06-17 08:48:36.744389: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-17 08:48:36.849971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-17 08:48:37.125541: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-06-17 08:48:37.125719: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1329 : Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
  File "main.py", line 34, in <module>
    helper.exec()
  File "/home/Share/FunctionLevelVulnerabilityDetectionUpgrading/src/helper.py", line 340, in exec
    class_weight = class_weights)
  File "/usr/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Fail to find the dnn implementation.
     [[{{node cu_dnnlstm_1/CudnnRNN}}]]
  (1) Unknown: Fail to find the dnn implementation.
     [[{{node cu_dnnlstm_1/CudnnRNN}}]]
     [[loss/mul/_99]]
0 successful operations.
0 derived errors ignored.

Can you take a look at this and let me know what went wrong here. Best regards Hai Nguyen.

cybercodeintelligence commented 4 years ago

[INFO] Model structure loaded.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 100)         22900     
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 1000, 128)         117760    
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000, 128)         0         
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM)     (None, 1000, 128)         132096    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
Total params: 283,125
Trainable params: 260,225
Non-trainable params: 22,900
_________________________________________________________________
Train on 4 samples, validate on 2 samples
W0617 08:48:36.500040 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

W0617 08:48:36.500195 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Epoch 1/100
2020-06-17 08:48:36.744389: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-17 08:48:36.849971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-17 08:48:37.125541: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-06-17 08:48:37.125719: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1329 : Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
  File "main.py", line 34, in <module>
    helper.exec()
  File "/home/Share/FunctionLevelVulnerabilityDetectionUpgrading/src/helper.py", line 340, in exec
    class_weight = class_weights)
  File "/usr/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Fail to find the dnn implementation.
   [[{{node cu_dnnlstm_1/CudnnRNN}}]]
  (1) Unknown: Fail to find the dnn implementation.
   [[{{node cu_dnnlstm_1/CudnnRNN}}]]
   [[loss/mul/_99]]
0 successful operations.
0 derived errors ignored.

Can you take a look at this and let me know what went wrong here. Best regards Hai Nguyen.

Hi Hai,

This is because the GPU could not load the cuDNN library. The docker needs the Nvidia cuDNN library. Please try to use the LSTM instead of the cuDNNLSTM if you have the difficulty of installing the cuDNN library.

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , Thank you for your reply.

2020-06-17 08:48:36.849971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-17 08:48:37.125541: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

It successfully loaded libcudnn here, do you think that it may because of GPU setting config.gpu_options.allow_growth = True as in this link

If I use LSTM instead of CuDNNLSTM, then I am afraid the whole point of using GPU for speeding up the training process will not be possible. Best, Hai Nguyen.

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , Thank you very much for your support, after adding $ export TF_FORCE_GPU_ALLOW_GROWTH=true I was able to run with CuDNNLSTM now. To be honest, I have not fully understood this setting here. I will let you know further after trying out with other models and the validation phase. Best regards, Hai Nguyen.

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , @cybercodeintelligence Thank you for your support as always. After fixing the CuDNN error, I went to check the embedding with Glove problem, I have fixed it as follows: in embedding.py

    def TrainGlove(self, data_list):

        from glove import Corpus, Glove
        # creating a corpus object
        print ("----------------------------------------")
        print ("Start training the GLoVe model. Please wait.. ")
        corpus = Corpus()
        corpus.fit(data_list, window=self.glove_window)
        glove = Glove(no_components=self.components, learning_rate=self.glove_learning_rate)

        glove.fit(corpus.matrix, epochs=self.glove_epoch, no_threads=self.n_workers, verbose=True)
        glove.add_dictionary(corpus.dictionary)
        glove.save(self.tokenizer_saved_path + 'glove.model') # This is to save the model as a pkl file.

        # Save Glove model as .txt format for checking content
        vector_size = self.components
        with open(self.tokenizer_saved_path + 'results_glove.txt', "w") as f:
            for word in glove.dictionary:
                f.write(word)
                f.write(" ")
                for i in range(0, vector_size):
                    f.write(str(glove.word_vectors[glove.dictionary[word]][i]))
                    f.write(" ")
                f.write("\n")

        print("GLOVE SAVE HERE",self.tokenizer_saved_path + 'glove.model')
        print ("Model training completed!")
        print ("----------------------------------------")

    def ApplyGlove(self, word_index):
        print("GLOVE OPEN HERE",self.tokenizer_saved_path  + 'glove.model')
        from glove import Corpus, Glove
        glove_model = Glove.load(self.tokenizer_saved_path + 'glove.model')
        print('glove model',glove_model.dictionary)
        print('glove model',glove_model.word_vectors)

        key_list = list(glove_model.dictionary.keys())
        word_vector_list = glove_model.word_vectors.tolist()

        # with open(self.tokenizer_saved_path + 'glove.model', 'rb') as f:
        #     glove_model = pickle.load(f)         
        # key_list = list(glove_model['dictionary'].keys())
        # word_vector_list = glove_model['word_vectors'].tolist()

        embeddings_index = {}
        for index, item in enumerate(key_list):
            word = key_list[index]
            coefs = np.asarray(word_vector_list[index], dtype='float32')
            embeddings_index[word] = coefs
        print('Loaded %s word vectors.' % len(embeddings_index))

        embedding_matrix = np.zeros((len(word_index) + 1, self.components))
        for word, i in word_index.items():
           embedding_vector = embeddings_index.get(word)
           if embedding_vector is not None:
               # words not found in embedding index will be all-zeros.
               embedding_matrix[i] = embedding_vector

        return embedding_matrix, self.components

then in helper.py embedding_matrix, embedding_dim = embedding_model.ApplyGlove(word_index) With the above modifications, I could train the data with Glove. Would you mind to take a look at this fix and see if it is legitimate ?

Also, when running experiments on function-level, is that okay to arrange all the functions files from 9 projects into 1 data folder, then run the python scripts, pointing to that data folder ?

Thank you for your time. Best regards, Hai Nguyen.

DanielLin1986 commented 4 years ago

Dear @DanielLin1986 , Thank you very much for your support, after adding $ export TF_FORCE_GPU_ALLOW_GROWTH=true I was able to run with CuDNNLSTM now. To be honest, I have not fully understood this setting here. I will let you know further after trying out with other models and the validation phase. Best regards, Hai Nguyen.

Hi Hai,

This is a great finding!

The "TF_FORCE_GPU_ALLOWGROWTH=true" prevents the GPU to allocate all its memory to the process. But, the errors "Fail to find the dnn implementation." seemed to be related with the DNN implementation. I did not expect that adding this line would help, which is great anyway! ^^ To be frank, I also do not fully know how this works... Please update me any new findings.

Yes, you're right. If you can use the CuDNN LSTM, the training process will be significantly faster.

Best regards,

Daniel Lin

DanielLin1986 commented 4 years ago

    def TrainGlove(self, data_list):

        from glove import Corpus, Glove
        # creating a corpus object
        print ("----------------------------------------")
        print ("Start training the GLoVe model. Please wait.. ")
        corpus = Corpus()
        corpus.fit(data_list, window=self.glove_window)
        glove = Glove(no_components=self.components, learning_rate=self.glove_learning_rate)

        glove.fit(corpus.matrix, epochs=self.glove_epoch, no_threads=self.n_workers, verbose=True)
        glove.add_dictionary(corpus.dictionary)
        glove.save(self.tokenizer_saved_path + 'glove.model') # This is to save the model as a pkl file.

        # Save Glove model as .txt format for checking content
        vector_size = self.components
        with open(self.tokenizer_saved_path + 'results_glove.txt', "w") as f:
            for word in glove.dictionary:
                f.write(word)
                f.write(" ")
                for i in range(0, vector_size):
                    f.write(str(glove.word_vectors[glove.dictionary[word]][i]))
                    f.write(" ")
                f.write("\n")

        print("GLOVE SAVE HERE",self.tokenizer_saved_path + 'glove.model')
        print ("Model training completed!")
        print ("----------------------------------------")

    def ApplyGlove(self, word_index):
        print("GLOVE OPEN HERE",self.tokenizer_saved_path  + 'glove.model')
        from glove import Corpus, Glove
        glove_model = Glove.load(self.tokenizer_saved_path + 'glove.model')
        print('glove model',glove_model.dictionary)
        print('glove model',glove_model.word_vectors)

        key_list = list(glove_model.dictionary.keys())
        word_vector_list = glove_model.word_vectors.tolist()

        # with open(self.tokenizer_saved_path + 'glove.model', 'rb') as f:
        #     glove_model = pickle.load(f)         
        # key_list = list(glove_model['dictionary'].keys())
        # word_vector_list = glove_model['word_vectors'].tolist()

        embeddings_index = {}
        for index, item in enumerate(key_list):
            word = key_list[index]
            coefs = np.asarray(word_vector_list[index], dtype='float32')
            embeddings_index[word] = coefs
        print('Loaded %s word vectors.' % len(embeddings_index))

        embedding_matrix = np.zeros((len(word_index) + 1, self.components))
        for word, i in word_index.items():
           embedding_vector = embeddings_index.get(word)
           if embedding_vector is not None:
               # words not found in embedding index will be all-zeros.
               embedding_matrix[i] = embedding_vector

        return embedding_matrix, self.components

Also, when running experiments on function-level, is that okay to arrange all the functions files from 9 projects into 1 data folder, then run the python scripts, pointing to that data folder ?

Thank you for your time. Best regards, Hai Nguyen.

Hi Hai,

Much appreciated for your help and contribution! I have read the code you wrote and I did not see any issues. Currently, I cannot run the code since my machine had problems installing glove. If you can successfully run the code, that means everything is fine!

It is okay to arrange all the function files from 9 projects into 1 data folder. Please change the path of the codebase to the data folder and it should be fine.

You are free to modify the code according to your needs. ^_^

Best regards,

Daniel Lin

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 @cybercodeintelligence , Thank you very much for your reply. So far, I could learn a lot from your paper and your project. It is really helpful for such newbie in this research field like me. The code is running okay on tensorfow 1.14-gpu, so I try it on tensorflow 2.x , but it showed an error, which you may already acknowledge:

src/DataLoader.py", line 119, in SavedPickle
    pickle.dump(file_to_save, handle)
TypeError: can't pickle _thread.RLock objects

May I ask what was the purpose of saving the pickle file during training since we already save the model as .h5 file ? Will removing the Tensorboard part in the call back list affect to evaluation phase ?

Also, as the paper mentioned about the metric top-k percentage as a list of retrieved functions accounted for k% total functions in the test set , how can I customize the top-k percentage (10%,20%,50%) and check them after training process ?

Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen

DanielLin1986 commented 4 years ago

Dear @DanielLin1986 @cybercodeintelligence , Thank you very much for your reply. So far, I could learn a lot from your paper and your project. It is really helpful for such newbie in this research field like me. The code is running okay on tensorfow 1.14-gpu, so I try it on tensorflow 2.x , but it showed an error, which you may already acknowledge:
src/DataLoader.py", line 119, in SavedPickle
   pickle.dump(file_to_save, handle)
TypeError: can't pickle _thread.RLock objects
May I ask what was the purpose of saving the pickle file during training since we already save the model as .h5 file ? Will removing the Tensorboard part in the call back list affect to evaluation phase ?

Also, as the paper mentioned about the metric top-k percentage as a list of retrieved functions accounted for k% total functions in the test set , how can I customize the top-k percentage (10%,20%,50%) and check them after training process ?

Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen

Hi Hai,

The pickle.dump() function in the DataLoader.py aims to save the processed code sequence. This function is not used in the DataLoader.py. If it causes the error, you can comment it. Pickle is a Python module. If it is okay on TensorFlow 1.x, it should be okay on TensorFlow 2.x. Please have a look.

Yes, you can customize the top-k percentage (10%,20%,50%). The result is a list of retrieved functions with probabilities of being vulnerable. You can rank the functions based on their probabilities. If your test set contains 100 samples, to obtain the top-10% is to get the first 10 samples ranked by their probabilities of being vulnerable (10 most probable ones).

Best regards,

Daniel

anhhaibkhn commented 4 years ago

Dear @DanielLin1986, Thank you very much for your support. Sorry that it took me a while to reproduce the experiments.

Eventually, on tensorflow 1.14.1, using the provided dataset of 9 projects. I manage to train Bi-LSTM model with 3 embedding methods Word2vec, Glove and Fasttext. At this moment, I am now at step 2: running test for the trained models. I arrange all the vulnerable and non-vulnerable functions into a folder, then run test for the trained models with test_set_path navigate to that folder (the same folder using for training process). For example, I used the proposed dataset to trained Bi-LSTM model with Word2vec embedding methode, and run test on it, I could receive the csv result. and here was its log

61168/61168 [==============================] - 101s 2ms/step
[INFO] bilstm classification result: 
[INFO] Total accuracy: 0.9758565369106641
[INFO] ----------------------------------------------------
[INFO] The confusion matrix: 

[[59691     4]
 [ 1473     0]]

                precision    recall  f1-score   support

Non-vulnerable       0.98      1.00      0.99     59695
    Vulnerable       0.00      0.00      0.00      1473

      accuracy                           0.98     61168
     macro avg       0.49      0.50      0.49     61168
  weighted avg       0.95      0.98      0.96     61168

Here, may I confirm whether I understand correctly how to obtain top-k percentage metric or not? For example : Top 10% Here I ran test on 61168 files, top 10%would be around 6116 files with the highest Probs. of being vulnerable based on the collected csv file, then based on its label, TP@k% are those actual vulnerable functions and FP@k% are those false vulnerable functions among 6116 files. Then if FN@k% refers to the true vulnerable samples missed by the model when returning k% functions, would FN@k% be equal to FP@k% ? I apologize that this comment was abit long since I just want to confirm my understanding of your paper.

Also, regarding to the SARD dataset, since the vulnerable functions file names do not have the key word 'cve' or 'CVE' for generating label, do I need to modify the GenerateLabels function to catch different key words such as 'bad' for vulnerable, and 'good' for non-vulnerable ones ?

Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , @cybercodeintelligence Thank you very much for your support as always.

Formulas in the paper: P@K% = TP@k% / (TP@k% + FP@k%) R@K% = TP@k% / (TP@k% + FN@k%) For example: the total number of test functions was 61168, vulnerable: 1473 Top 10% will get around 6116 files with the highest vulnerable probs, but only 1470 vulnerable files were truly vulnerable, 3 files were missing, and 4646 non-vulnerable files were retrieved. So please correct me if I am wrong: TP@k% = 1470 FP@k% = 4646 FN@k% = 3 then P@K and R@K will be calculated as the above formulas.

Thank you very much for your time Looking forward to hearing from you. Hai Nguyen.

cybercodeintelligence commented 4 years ago

Dear @DanielLin1986, Thank you very much for your support. Sorry that it took me a while to reproduce the experiments.

Eventually, on tensorflow 1.14.1, using the provided dataset of 9 projects. I manage to train Bi-LSTM model with 3 embedding methods Word2vec, Glove and Fasttext. At this moment, I am now at step 2: running test for the trained models. I arrange all the vulnerable and non-vulnerable functions into a folder, then run test for the trained models with test_set_path navigate to that folder (the same folder using for training process). For example, I used the proposed dataset to trained Bi-LSTM model with Word2vec embedding methode, and run test on it, I could receive the csv result. and here was its log
61168/61168 [==============================] - 101s 2ms/step
[INFO] bilstm classification result: 
[INFO] Total accuracy: 0.9758565369106641
[INFO] ----------------------------------------------------
[INFO] The confusion matrix: 

[[59691     4]
 [ 1473     0]]

                precision    recall  f1-score   support

Non-vulnerable       0.98      1.00      0.99     59695
    Vulnerable       0.00      0.00      0.00      1473

      accuracy                           0.98     61168
     macro avg       0.49      0.50      0.49     61168
  weighted avg       0.95      0.98      0.96     61168
Here, may I confirm whether I understand correctly how to obtain top-k percentage metric or not? For example : Top 10% Here I ran test on 61168 files, top 10%would be around 6116 files with the highest Probs. of being vulnerable based on the collected csv file, then based on its label, TP@k% are those actual vulnerable functions and FP@k% are those false vulnerable functions among 6116 files. Then if FN@k% refers to the true vulnerable samples missed by the model when returning k% functions, would FN@k% be equal to FP@k% ? I apologize that this comment was abit long since I just want to confirm my understanding of your paper.

Also, regarding to the SARD dataset, since the vulnerable functions file names do not have the key word 'cve' or 'CVE' for generating label, do I need to modify the GenerateLabels function to catch different key words such as 'bad' for vulnerable, and 'good' for non-vulnerable ones ?

Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen

Hi Hai,

Sorry for the late reply. Yes, regarding the Top-k part, I think you have correctly understood it.

The vulnerable functions of the SARD data are all in one folder and the non-vulnerable ones are in another folder, which is different from the real-world samples from 9 open-source projects. The GenerateLabels function is for generating labels for the data from 9 open-source projects.

You will have to tune the networks since the result you provided showed that the precision and recall of the vulnerable class being all ZERO.

Best regards,

Daniel

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , Thank you for your reply, It seems like I misunderstood how to use the benchmark properly. If the GenerateLables function is only the 9 projects dataset, then how do we get the labels after loading the SARD dataset ?

I am not sure how to do proper tuning for the model, the test results, which I collected after training Bi-LSTM using word2vec embedding on the 9 projects, showed very low precision for top k metrics.

Thank you for your time. Looking forward to hearing from you. Best regards Hai

cybercodeintelligence commented 4 years ago

Dear @DanielLin1986 , Thank you for your reply, It seems like I misunderstood how to use the benchmark properly. If the GenerateLables function is only the 9 projects dataset, then how do we get the labels after loading the SARD dataset ?

I am not sure how to do proper tuning for the model, the test results, which I collected after training Bi-LSTM using word2vec embedding on the 9 projects, showed very low precision for top k metrics.

Thank you for your time. Looking forward to hearing from you. Best regards Hai

Hi Hai,

Apologies for the late reply.

Obtaining the labels from the SARD dataset may require you to write the code. For example, you can specify the SARD data from the vulnerable function folder as "1", and the data from the non-vulnerable function folder as "0".

If you use the SARD dataset or the 9-project dataset, please set the using_separate_test_set in the configuration file to False. Then, the code will automatically partition the dataset into training, validation, and test sets. Then, just wait to see the results. When you set the using_separate_test_set to False, you just need to specify where your SARD dataset or the 9-project dataset is. ^_^

If you set the test_set_path to True, the code will only partition the dataset into the training set and the validation set, which means you have to specify a path where the test set is stored, so the trained model will test on your specified test set.

Best regards,

Daniel

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , Thank you very much for your reply,

If you use the SARD dataset or the 9-project dataset, please set the using_separate_test_set in the configuration file to False. Then, the code will automatically partition the dataset into training, validation, and test sets. Then, just wait to see the results. When you set the using_separate_test_set to False, you just need to specify where your SARD dataset or the 9-project dataset is.

This requires that when I trained with the SARD dataset or 9-project dataset, I should put them in the default data folder data/. Am I right? Because as I check that when testing the trained model, the code is not allowed data folder argument anymore. This means that either set the using_separate_test_set to True then state the location of the dataset or set it to False, and it will test the data in the default folder. However, I think that the current code is not supporting the SARD dataset, and I still need to modify the Label function for this case whether or not I set using_separate_test_set to True or False

If you set the test_set_path to True, the code will only partition the dataset into the training set and the validation set, which means you have to specify a path where the test set is stored, so the trained model will test on your specified test set.

If setting the test_set_path to True, the code will partition the dataset to training and validation, but I think it happens only in step 1 - training. In step 2, should the loaded model just go straight to test the separate test set?

I have also sent you an email last week, I would like to confirm if you were able to receive it? Thank you for your time. Looking forward to hearing from you. Best regards Hai Nguyen

anhhaibkhn commented 4 years ago

Dear @DanielLin1986 , thank you for your last reply,

I set using_separate_test_set to False and put all the 9-project dataset into the default data folder data/, I was able to get the following results with Bi-LSTM and word2vec embedding method :

                precision    recall  f1-score   support

Non-vulnerable       0.98      1.00      0.99     11916
    Vulnerable       0.77      0.39      0.52       318

      accuracy                           0.98     12234
     macro avg       0.87      0.69      0.75     12234
  weighted avg       0.98      0.98      0.98     12234

It looks better than the previously posted result, but I have not changed any hyperparameters here, so I am not sure that tuning can help to improve the above results or not.

I also want to try out with the SARD dataset to reproduce your work, so far I understand the labeling idea in your last reply. I just want to ask if anything I need to notice before adding the SARD label generating function such as when labeling data before training and testing.

Thank you for your time Best regards Hai Nguyen

cybercodeintelligence commented 3 years ago

Dear @DanielLin1986 , thank you for your last reply,

I set using_separate_test_set to False and put all the 9-project dataset into the default data folder data/, I was able to get the following results with Bi-LSTM and word2vec embedding method :
                precision    recall  f1-score   support

Non-vulnerable       0.98      1.00      0.99     11916
    Vulnerable       0.77      0.39      0.52       318

      accuracy                           0.98     12234
     macro avg       0.87      0.69      0.75     12234
  weighted avg       0.98      0.98      0.98     12234
It looks better than the previously posted result, but I have not changed any hyperparameters here, so I am not sure that tuning can help to improve the above results or not.

I also want to try out with the SARD dataset to reproduce your work, so far I understand the labeling idea in your last reply. I just want to ask if anything I need to notice before adding the SARD label generating function such as when labeling data before training and testing.

Thank you for your time Best regards Hai Nguyen

Hi Hai,

Apologies again for the late reply.

Well done! You have a result. But, I am afraid that tuning and optimizing a neural network is a challenging task. I am also working on it. I think how to tune and optimize the hyperparameters can be experience-driven and heuristic (I may be wrong.). You can try by starting from modifying the optimizer, the batch_size, and the number of the epoch.

At the current stage, the data, particularly the vulnerable data, is insufficient for training a neural network. Therefore, to obtain more labeled data for training can definitely contribute to a better detection result.

With regard to the SARD data, I think you have everything ready as long as you can obtain the SARD label. One thing you may need to consider is the "good" and "bad" phrases that appear in the function body. These phrases may bias the model. You can have a try first on the SARD data anyway.

Best regards,

Daniel LIn

anhhaibkhn commented 3 years ago

Dear @DanielLin1986 , @cybercodeintelligence, Thank you very much for your reply. I will run more experiments to tune the neural network for the 9 projects dataset as you advised.

Regarding the SARD data, thanks to the organized folders ( non-vul and vul ) as you provided. Writing python scripts for labeling them can be done without problems. However, without you mentioned the word 'bad' and 'good' in the function body, I could not see the model bias here. Thank you so much! The NLP embedding may rely on those words to predict the vulnerability but not the function content. May I ask if I create a script to replaces those words (good, bad) with dummy names to train the classifier? Also if it is possible, could you advise me on how to get the test results for the SARD dataset as in your paper?

Thank you for your time. With best regards Hai Nguyen

cybercodeintelligence commented 3 years ago

Dear @DanielLin1986 , @cybercodeintelligence, Thank you very much for your reply. I will run more experiments to tune the neural network for the 9 projects dataset as you advised.

Regarding the SARD data, thanks to the organized folders ( non-vul and vul ) as you provided. Writing python scripts for labeling them can be done without problems. However, without you mentioned the word 'bad' and 'good' in the function body, I could not see the model bias here. Thank you so much! The NLP embedding may rely on those words to predict the vulnerability but not the function content. May I ask if I create a script to replaces those words (good, bad) with dummy names to train the classifier? Also if it is possible, could you advise me on how to get the test results for the SARD dataset as in your paper?

Thank you for your time. With best regards Hai Nguyen

Hi Hai,

No worries.

Yes. Using dummy names to replace the "good" "bad" words was exactly what I did.

To automatically obtain the test results for the SARD dataset may not be easy. What I did was:

Build an index table to store the function name and label pairs, e.g., the first column contains the function name, the second column contains its corresponding label. By the function names, you can their labels.
Mix the vulnerable and non-vulnerable functions together, e.g., in a folder.
Use the partition method as you do for the 9 open-source projects to partition the data into three parts.
For each sample in each part, use the index table to get the label by a function name.
You can process the SARD data as you do for the data from 9 open-source projects.

Hopefully, the above description answers your question.

Best regards,

Daniel Lin.

anhhaibkhn commented 3 years ago

Dear @DanielLin1986 , cc @cybercodeintelligence , That was a great idea, Thank you very much for your help.

I will proceed to process the SARD dataset as your advice, I hope that I can reproduce the result for the SARD dataset as soon as I can this week and get back to you.

Thank you for your time. With best regards Hai Nguyen

DanielLin1986 commented 3 years ago

Dear @DanielLin1986 , cc @cybercodeintelligence , That was a great idea, Thank you very much for your help.

I will proceed to process the SARD dataset as your advice, I hope that I can reproduce the result for the SARD dataset as soon as I can this week and get back to you.

Thank you for your time. With best regards Hai Nguyen

Hi Hai,

Good luck. Hope that everything goes well.

Best regards,

Daniel Lin

anhhaibkhn commented 3 years ago

Dear @DanielLin1986 , @cybercodeintelligence , Thank you very much for your last reply,

I have added a script to process the SARD dataset to remove these keywords: 'bad', 'BAD', 'GOOD', 'good'. Then, label them followed its file name. Here are the logs I collected when running the test for the trained model using word2vec:

Total number of vulnerable functions is: 3318/15000

TOP  1.00%   150 files
value 1 was found 150 times
The True positive samples TP@k% is  150
value 0 was found 0 times
The False positive samples FP@k% is  0
The top-k percentage P@K% is  100.00%
The top-k percentage R@K% is  4.52%

 TOP  10.00%   1500 files
value 1 was found 1500 times
The True positive samples TP@k% is  1500
value 0 was found 0 times
The False positive samples FP@k% is  0
The top-k percentage P@K% is  100.00%
The top-k percentage R@K% is  45.21%

 TOP  20.00%   3000 files
value 1 was found 2673 times
The True positive samples TP@k% is  2673
value 0 was found 327 times
The False positive samples FP@k% is  327
The top-k percentage P@K% is  89.10%
The top-k percentage R@K% is  80.56%

 TOP  50.00%   7500 files
value 1 was found 3317 times
The True positive samples TP@k% is  3317
value 0 was found 4183 times
The False positive samples FP@k% is  4183
The top-k percentage P@K% is  44.23%
The top-k percentage R@K% is  99.97%

This is very close to the results you had introduced in the paper, I think the difference was the number of vulnerable files. Here, Please correct me if I am wrong, I think the 'train_test_split' function in 'train_test_split' automatically shuffle and separate the data followed the Test_set_ratio. So, do we still need to mix the data before use for training? It has been a productive time to follow your work. I noticed that the paper did not mention the 'attention' method along with 'Elmo' model implementation, may I ask for the idea on how to implement the 'attention' method for deep learning model as well as possibilities to extend your benchmark with other language models like BERT or Elmo.

Thank you very much for your time. With best regards Nguyen Hai.

cybercodeintelligence commented 3 years ago

Dear @DanielLin1986 , @cybercodeintelligence , Thank you very much for your last reply,

I have added a script to process the SARD dataset to remove these keywords: 'bad', 'BAD', 'GOOD', 'good'. Then, label them followed its file name. Here are the logs I collected when running the test for the trained model using word2vec:

Total number of vulnerable functions is: 3318/15000
TOP  1.00%   150 files
value 1 was found 150 times
The True positive samples TP@k% is  150
value 0 was found 0 times
The False positive samples FP@k% is  0
The top-k percentage P@K% is  100.00%
The top-k percentage R@K% is  4.52%

TOP  10.00%   1500 files
value 1 was found 1500 times
The True positive samples TP@k% is  1500
value 0 was found 0 times
The False positive samples FP@k% is  0
The top-k percentage P@K% is  100.00%
The top-k percentage R@K% is  45.21%

TOP  20.00%   3000 files
value 1 was found 2673 times
The True positive samples TP@k% is  2673
value 0 was found 327 times
The False positive samples FP@k% is  327
The top-k percentage P@K% is  89.10%
The top-k percentage R@K% is  80.56%

TOP  50.00%   7500 files
value 1 was found 3317 times
The True positive samples TP@k% is  3317
value 0 was found 4183 times
The False positive samples FP@k% is  4183
The top-k percentage P@K% is  44.23%
The top-k percentage R@K% is  99.97%
This is very close to the results you had introduced in the paper, I think the difference was the number of vulnerable files. Here, Please correct me if I am wrong, I think the 'train_test_split' function in 'train_test_split' automatically shuffle and separate the data followed the Test_set_ratio. So, do we still need to mix the data before use for training? It has been a productive time to follow your work. I noticed that the paper did not mention the 'attention' method along with 'Elmo' model implementation, may I ask for the idea on how to implement the 'attention' method for deep learning model as well as possibilities to extend your benchmark with other language models like BERT or Elmo.

Thank you very much for your time. With best regards Nguyen Hai.

Hi Hai,

You are welcome. Thanks for being interested in our work, again.

Your results look good. According to your results, using the framework, examining 50% of the functions could identify 99.97% of the vulnerable functions (Top 50% recall was 99.97%). This is very close to our results.

Yes. I also think that the 'train_test_split' function performs the shuffle and the partition of the data sets. When using the Six-project data sets, I mixed the data to make sure that the data from every project exist in the training, validation, and test sets. Therefore, I think you can mix the data based on your experimental settings.

Yes. You are right. The 'attention' and 'ELMo' were added after the completion of the paper. The attention (HAN) mechanism that I am using is implemented by Luiz Felix (Github link: https://github.com/lzfelix/keras_attention). Please feel free to use the code. You can also use other types of attention mechanisms. For example, this paper: 'Software Defect Prediction via Attention-Based Recurrent Neural Network' uses the self-attention mechanism (Correct me if I am wrong).

The BERT and ELMo are good ideas! I am also working on it. Personally, I think using these language models as the code embedding solutions are feasible. However, the ELMo embedding results are not significantly better compared with the current deep learning-based methods. The BERT is challenging because you need very powerful GPUs. Anyway, how to effectively explore the potentials of these language models for code embeddings is still an open question. Maybe we need to customize the language models and/or the neural network to fit the code. Welcome to discuss.

Best regards,

Daniel Lin

anhhaibkhn commented 3 years ago

Dear Mr. Daniel Lin, cc: @DanielLin1986 , @cybercodeintelligence Thank you very much for your detailed explanation!

During my surveying period of applying deep learning methods to detect code vulnerabilities, I have found your papers, together with your projects, to be greatly helpful for my graduate study. I am currently taking my MS degree at Ritsumeikan University in Japan, and I would like to know whether it is possible to use your project as the foundation for my own research? I would love to have further discussion with you when you have free time.

Thank you very much for your time. With best regards Hai Nguyen

DanielLin1986 commented 3 years ago

Dear Mr. Daniel Lin, cc: @DanielLin1986 , @cybercodeintelligence Thank you very much for your detailed explanation!

During my surveying period of applying deep learning methods to detect code vulnerabilities, I have found your papers, together with your projects, to be greatly helpful for my graduate study. I am currently taking my MS degree at Ritsumeikan University in Japan, and I would like to know whether it is possible to use your project as the foundation for my own research? I would love to have further discussion with you when you have free time.

Thank you very much for your time. With best regards Hai Nguyen

Hi Hai,

No worries. It is our pleasure.

Yes. Of course. You are welcomed to use our projects, including our code and data, to start your research, as long as you kindly cite our papers. You can use, modify, and improve the code. However, please be noted that if you would like to share our data with your classmates/fellow researchers, please let us know.

Yes. Discussions and comments are welcomed. Thank you for being interested in our projects.

All the best!

Regards,

Daniel Lin

anhhaibkhn commented 3 years ago

Dear Mr. Daniel Lin, cc: @DanielLin1986 , @cybercodeintelligence Thank you very much for your detailed explanation! During my surveying period of applying deep learning methods to detect code vulnerabilities, I have found your papers, together with your projects, to be greatly helpful for my graduate study. I am currently taking my MS degree at Ritsumeikan University in Japan, and I would like to know whether it is possible to use your project as the foundation for my own research? I would love to have further discussion with you when you have free time. Thank you very much for your time. With best regards Hai Nguyen

Hi Hai,

No worries. It is our pleasure.

Yes. Of course. You are welcomed to use our projects, including our code and data, to start your research, as long as you kindly cite our papers. You can use, modify, and improve the code. However, please be noted that if you would like to share our data with your classmates/fellow researchers, please let us know.

Yes. Discussions and comments are welcomed. Thank you for being interested in our projects.

All the best!

Regards,

Daniel Lin

Dear @DanielLin1986 , @cybercodeintelligence , Thank you very much for your answer,

This is great news for me. Yes, and of course, I will properly cite and inform you if anyone wants to access your data. Thank you again for your kind help and generosity. I am the only one in my lab to do this research and to be honest, I learn the most from your papers and experiments. Recently, I added the pre-trained GloVe and pre-trained BERT to the embedding module. However, I notice that changing the embedding method did not really improve the precision or recall for the Nine-projects dataset. The same result happens for the SARD dataset. So, I am thinking of trying out with different deep learning models like GRU and Bi-GRU instead of LSTM and Bi-LSTM. May I ask if you think of any other deep models or changing the current model structures that could potentially improve the current results, it would be very useful for me to hear from your advice.

I have also read your recent paper too, DeepBalance: Deep-Learning and Fuzzy Oversampling for Vulnerability Detection, it was a great idea to rebalance the dataset with the fuzzy oversampling method. It would be great if you could let me know if there is a conference or online workshop that you are going to attend or present. I really appreciate it if I can hear you present in person.

Thank you for your time. With best regards Hai Nguyen

anhhaibkhn commented 3 years ago

Dear @DanielLin1986 , @cybercodeintelligence Thank you for your last reply, Hope you are doing well.

Currently, I am working on constructing different models like the one you suggested in Software Defect Prediction via Attention-Based Recurrent Neural Network' uses the self-attention mechanism The paper's model was built by the Bi-LSTM layer then the Attention layer. I can't help but notice that the benchmark currently employed the LSTM_with_HAN model. This one only has an LSTM layer but not a bidirectional layer as in the mentioned paper.

Also, I got the results after trying out with all of the RNNs (eg BiGRU and GRU). The results are pretty similar to Bi-LSTM and LSTM for the SARD dataset, where all top 1% and top 10% precision rates reached 100%, and the top 50% recall rate reached 100%. Do you think that was because the dataset was artificially synthesized, so almost all models could reach their satisfying results? If so, at this point, the only comparisons that are useful would only go to the detectors trained on the Nine-projects.

Thank you for your time. With best regards Hai Nguyen.

DanielLin1986 commented 3 years ago

Dear @DanielLin1986 , @cybercodeintelligence Thank you for your last reply, Hope you are doing well.

Currently, I am working on constructing different models like the one you suggested in Software Defect Prediction via Attention-Based Recurrent Neural Network' uses the self-attention mechanism The paper's model was built by the Bi-LSTM layer then the Attention layer. I can't help but notice that the benchmark currently employed the LSTM_with_HAN model. This one only has an LSTM layer but not a bidirectional layer as in the mentioned paper.

Also, I got the results after trying out with all of the RNNs (eg BiGRU and GRU). The results are pretty similar to Bi-LSTM and LSTM for the SARD dataset, where all top 1% and top 10% precision rates reached 100%, and the top 50% recall rate reached 100%. Do you think that was because the dataset was artificially synthesized, so almost all models could reach their satisfying results? If so, at this point, the only comparisons that are useful would only go to the detectors trained on the Nine-projects.

Thank you for your time. With best regards Hai Nguyen.

Hi Hai,

Glad to hear from you again.

Yes. Our experiments also revealed that the performance of LSTM/GRU and their bidirectional forms on the SARD dataset are very similar (no statistical significance).

We also believe that the patterns of the artificially synthesized samples are much easier to capture compared to the real-world samples by the neural models.

The hard part is to accurately detect the vulnerabilities on the real-world projects. Also, the binary-level detection is challenging. Currently, I am facing difficulties in obtaining more labeled real-world data. I am also seeking a neural network structure that can achieve much better detection results, which requires time and effort to explore.

Best regards,

Daniel

anhhaibkhn commented 3 years ago

Dear Mr. Daniel, Thank you for your reply.

Building up the real-world data with labels is a tough task since the supervised models require a larger dataset. I will try out with different model structures on the Nine-projects dataset, and let you know when the results come out.

Please let me know if is there anything I can contribute to extending the benchmark. For example, collecting more data or implementing new model structures. I look forward to reading your coming work.

Thank you for your time. With best regards, Hai Nguyen.