Closed anhhaibkhn closed 3 years ago
I managed to get through the above error, by adding "self" to this line
def make_elmo_embedding(self,x)
in the Class elmo_model
then after that, inside that Class too, I try to return the model
for running function Summary()
in "helper" but, it shows different error now, May I ask how to set the training process to not use "Elmo" and try out with other models first ?
Also, would it be possible to upgrade the scripts for tensorflow 2.x ?
Thank you for reading my message.
Model: "Elmo_network"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, None) 0
_________________________________________________________________
lambda_1 (Lambda) (None, None, 1024) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 256) 1180672
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 256) 394240
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 16448
_________________________________________________________________
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 1,591,425
Trainable params: 1,591,425
Non-trainable params: 0
_________________________________________________________________
C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:92: UserWarning: The TensorBoard callback `batch_size` argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
warnings.warn('The TensorBoard callback `batch_size` argument '
C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:97: UserWarning: The TensorBoard callback does not support gradients display when using TensorFlow 2.0. The `write_grads` argument is ignored.
warnings.warn('The TensorBoard callback does not support '
Train on 397 samples, validate on 100 samples
Traceback (most recent call last):
File "main.py", line 35, in <module>
helper.exec()
File "E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\helper.py", line 275, in exec
class_weight = class_weights)
File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training.py", line 1239, in fit
validation_freq=validation_freq)
File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training_arrays.py", line 119, in fit_loop
callbacks.set_model(callback_model)
File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\callbacks.py", line 68, in set_model
callback.set_model(model)
File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py", line 116, in set_model
super(TensorBoard, self).set_model(model)
File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1532, in set_model
self.log_dir, self.model._get_distribution_strategy()) # pylint: disable=protected-access
AttributeError: 'Model' object has no attribute '_get_distribution_strategy'
Sorry, I hope this reply finds you well.
As I suspect the TensorFlow version may be the problem, I downgrade it to ver 1.14 as you advised on the other topic. However, I still got the following error, it looks like Tensorflow's again.
May I ask would it be possible to skip --call back process or try with other models like LSTM ?
File "main.py", line 34, in <module>
helper.exec()
File "D:\Workspace\Research\Function-level-Vulnerability-Detection-master\src\helper.py", line 275, in exec
class_weight = class_weights)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1705, in fit
validation_steps=validation_steps)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1236, in _fit_loop
outs = f(ins_batch)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\backend\tensorflow_backend.py", line 2482, in _call_
**self.session_kwargs)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
run_metadata)
File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.
Sorry, I hope this reply finds you well. As I suspect the TensorFlow version may be the problem, I downgrade it to ver 1.14 as you advised on the other topic. However, I still got the following error, it looks like Tensorflow's again. May I ask would it be possible to skip --call back process or try with other models like LSTM ?
File "main.py", line 34, in <module> helper.exec() File "D:\Workspace\Research\Function-level-Vulnerability-Detection-master\src\helper.py", line 275, in exec class_weight = class_weights) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1705, in fit validation_steps=validation_steps) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\engine\training.py", line 1236, in _fit_loop outs = f(ins_batch) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\keras\backend\tensorflow_backend.py", line 2482, in _call_ **self.session_kwargs) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 950, in run run_metadata_ptr) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run run_metadata) File "D:\Workspace\Research\Function-level-Vulnerability-Dataset\env\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Unable to get element as bytes.
Hi Anhhaibkhn, I am sorry for the late reply. It seems that the "class_weight = class_weights" caused the error. Please remove "class_weight = class_weights" and have a try.
To be compatible with Tensorflow 2.X and Keras 2.3.1, please remove the code that is related to TensorBoard. I am sorry that currently, the ELMo embedding is still not well supported. I will try to fix this ASAP.
I managed to get through the above error, by adding "self" to this line
def make_elmo_embedding(self,x)
in the Classelmo_model
then after that, inside that Class too, I try toreturn the model
for running functionSummary()
in "helper" but, it shows different error now, May I ask how to set the training process to not use "Elmo" and try out with other models first ? Also, would it be possible to upgrade the scripts for tensorflow 2.x ? Thank you for reading my message.Model: "Elmo_network"
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, None) 0 _________________________________________________________________ lambda_1 (Lambda) (None, None, 1024) 0 _________________________________________________________________ bidirectional_1 (Bidirection (None, None, 256) 1180672 _________________________________________________________________ bidirectional_2 (Bidirection (None, None, 256) 394240 _________________________________________________________________ global_max_pooling1d_1 (Glob (None, 256) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 256) 0 _________________________________________________________________ dense_1 (Dense) (None, 64) 16448 _________________________________________________________________ dense_2 (Dense) (None, 1) 65 ================================================================= Total params: 1,591,425 Trainable params: 1,591,425 Non-trainable params: 0 _________________________________________________________________
C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:92: UserWarning: The TensorBoard callback `batch_size` argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored. warnings.warn('The TensorBoard callback `batch_size` argument ' C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py:97: UserWarning: The TensorBoard callback does not support gradients display when using TensorFlow 2.0. The `write_grads` argument is ignored. warnings.warn('The TensorBoard callback does not support ' Train on 397 samples, validate on 100 samples Traceback (most recent call last): File "main.py", line 35, in <module> helper.exec() File "E:\SecureCodeWithNLP\Projects\Function-level-Vulnerability-Detection\src\helper.py", line 275, in exec class_weight = class_weights) File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training.py", line 1239, in fit validation_freq=validation_freq) File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\engine\training_arrays.py", line 119, in fit_loop callbacks.set_model(callback_model) File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\callbacks.py", line 68, in set_model callback.set_model(model) File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\keras\callbacks\tensorboard_v2.py", line 116, in set_model super(TensorBoard, self).set_model(model) File "C:\Users\nguye\Anaconda3\envs\project_env\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1532, in set_model self.log_dir, self.model._get_distribution_strategy()) # pylint: disable=protected-access AttributeError: 'Model' object has no attribute '_get_distribution_strategy'
Hi Anhhaibkhn, if you would like to use other embedding methods, for example, to use Word2vec, just specify:
Python main.py --config config/config.yaml --embedding word2vec
To use other types of networks, please change the 'model' tag to 'bilstm' in the configuration file named 'config.yaml'.
Dear Sir, Thank you very much for your reply, I have read your paper, and followed your work for a while.
Thanks in advance.
Dear Sir, Thank you very much for your reply, I have read your paper, and followed your work for a while.
- About the embedding method, can you tell me the version of Glove? word2vec and fasttext seem okay to run but not "glove".
- May I ask how to construct more datasets from the SARD Juliet suite. As I read from your paper, It was randomly extracted from the Juliet test suits files, but could you explain a bit more of the extraction process?
Thanks in advance.
No worries. Thank you for being interested in our project.
Hi Hai, the version of the glove Python implementation that I am using is 1.0.1. Yes. Actually, you can write a crawler to download the C test samples from the SARD Juliet suite. If you would like to have some of the extracted C functions from the SARD, please provide your email address and I can share the data with you.
Dear Sir, Thank you very much for your instant reply. That would be a great help to try out with the C test samples from SARD. My email is: nguyenngochaibkhn@gmail.com There was a part in the code which I feel a bit confused, I hope you could help me to understand it. when Glove model was called for training process, it supposed to return an embedding matrix :
elif embedding_method == 'glove':
from src.embedding import Glove as Embedding_Model
embedding_model = Embedding_Model(self.config)
total_sequences, word_index = embedding_model.LoadTokenizer(total_list)
embedding_model.TrainGlove(total_list)
embedding_matrix, embedding_dim = embedding_model.ApplyGlove()
However, in the body of the function, it did not return a matrix, but just a dictionary
def ApplyGlove(self):
with open(self.config.tokenizer_saved_path + os.sep + 'glove.model') as f:
glove_model = pickle.load(f)
key_list = list(glove_model['dictionary'].keys())
word_vector_list = glove_model['word_vectors'].tolist()
embeddings_index = {}
for index, item in enumerate(key_list):
word = key_list[index]
coefs = np.asarray(word_vector_list[index], dtype='float32')
embeddings_index[word] = coefs
print('Loaded %s word vectors.' % len(embeddings_index))
return embeddings_index, self.components
At the moment, I am trying to implement other embedding methods like use pre-trained models of Glove, BERT. However the vector embeddings came from these pre-trained models are not as useful so far since it was trained from language text. I would be really grateful if you could suggest the direction to extend this study further
Thank you very much for your time.
Dear Sir, Thank you very much for your instant reply. That would be a great help to try out with the C test samples from SARD. My email is: nguyenngochaibkhn@gmail.com There was a part in the code which I feel a bit confused, I hope you could help me to understand it. when Glove model was called for training process, it supposed to return an embedding matrix :
elif embedding_method == 'glove': from src.embedding import Glove as Embedding_Model embedding_model = Embedding_Model(self.config) total_sequences, word_index = embedding_model.LoadTokenizer(total_list) embedding_model.TrainGlove(total_list) embedding_matrix, embedding_dim = embedding_model.ApplyGlove()
However, in the body of the function, it did not return a matrix, but just a dictionary
def ApplyGlove(self): with open(self.config.tokenizer_saved_path + os.sep + 'glove.model') as f: glove_model = pickle.load(f) key_list = list(glove_model['dictionary'].keys()) word_vector_list = glove_model['word_vectors'].tolist() embeddings_index = {} for index, item in enumerate(key_list): word = key_list[index] coefs = np.asarray(word_vector_list[index], dtype='float32') embeddings_index[word] = coefs print('Loaded %s word vectors.' % len(embeddings_index)) return embeddings_index, self.components
At the moment, I am trying to implement other embedding methods like use pre-trained models of Glove, BERT. However the vector embeddings came from these pre-trained models are not as useful so far since it was trained from language text. I would be really grateful if you could suggest the direction to extend this study further
Thank you very much for your time.
Hi, I have sent the data to hai@cysec.cs.ritsumei.ac.jp before, but it seemed that you did not receive it. I have already sent the data to nguyenngochaibkhn@gmail.com. Please check.
Yes, you are right. the Glove model returns a dictionary. The dictionary contains key-value pairs. The keys are the code tokens and the values are generated embeddings whose dimensionality is the embedding_dim.
Great idea for using Glove and BERT for code analysis! I think the difficult part is to bridge the difference between natural languages and the software code. Looking forward to hearing the progress from you.
Dear Sir, Thank you very much for your reply, Sorry for the inconvenience, but both of my emails have not received the data C tests from SARD yet. Could you please check again? In my above comment, my point is that after training with Glove model, in order to proceed to the next part (training neural network), should it return the embedding matrix (2D list of vectors) like how it was done with Word2vec and Fast-text instead of returning the dictionary here. The neural network can not proceed to the training stage since one of its arguments seemed incorrect. Thank you for your time. Hai Nguyen.
Dear Sir, Thank you very much for your reply, Sorry for the inconvenience, but both of my emails have not received the data C tests from SARD yet. Could you please check again? In my above comment, my point is that after training with Glove model, in order to proceed to the next part (training neural network), should it return the embedding matrix (2D list of vectors) like how it was done with Word2vec and Fast-text instead of returning the dictionary here. The neural network can not proceed to the training stage since one of its arguments seemed incorrect. Thank you for your time. Hai Nguyen.
Hi Hai, I think it was because of the size of the attachment. It exceeded the limit of your email server so it was not delivered. Please find the following Github link for data downloading: https://github.com/cybercodeintelligence/CyberCI or https://cybercodeintelligence.github.io/CyberCI/
I will have a look at the Glove model and will get back to you. ^_^
Thank you very much for your reply, Great resources. I will dive right into it. These techniques are extremely useful for a newbie in this field like me.
I look forward to hearing from you.
Thank you very much for your reply, Great resources. I will dive right into it. These techniques are extremely useful for a newbie in this field like me.
I look forward to hearing from you.
No worries. I am truly hoping that the resources would be helpful and useful.
With regard to the Glove model, I think the embeddings_index and the components returned by the function ApplyGlove() are the embedding matrix and the embedding size needed for the subsequent processing. The embedding matrix is constructed using the dictionary. Please correct me if I am wrong since I wrote the code a long time ago.
Your comments and suggestions are welcomed.
Thank you for your reply,
Yes, however embedding matrix was not yet constructed in this case, since the function ApplyGlove only returns the embedding_index which is a dictionary. So the function may need to be modified a bit to return the embedding matrix.
Also, I could run the code for other embedding methods but right now I can not use GPU to train, I use Cuda version 10.1 and ubuntu 18.04. It seems the incompatible between drivers leads to that. May I ask for your Cuda, tensorflow-gpu version? Sorry that this post is becoming longer than expected. Thank you for your time.
Thank you for your reply,
Yes, however embedding matrix was not yet constructed in this case, since the function ApplyGlove only returns the embedding_index which is a dictionary. So the function may need to be modified a bit to return the embedding matrix.
Also, I could run the code for other embedding methods but right now I can not use GPU to train, I use Cuda version 10.1 and ubuntu 18.04. It seems the incompatible between drivers leads to that. May I ask for your Cuda, tensorflow-gpu version? Sorry that this post is becoming longer than expected. Thank you for your time.
Thanks for pointing out the bug. I will have a look.
I have tested the code on Windows 10 with Cuda 10.0. The driver version is 445.87. Tensorflow-gpu version is 1.14. I also tested the code on Windows 10 with Tensorflow-cpu whose version is 2.0.
Cuda 10.1 should be alright. Have you checked the GPU driver? What is the size of the GPU memory?
Thank you for your reply,
Yes, however embedding matrix was not yet constructed in this case, since the function ApplyGlove only returns the embedding_index which is a dictionary. So the function may need to be modified a bit to return the embedding matrix.
Also, I could run the code for other embedding methods but right now I can not use GPU to train, I use Cuda version 10.1 and ubuntu 18.04. It seems the incompatible between drivers leads to that. May I ask for your Cuda, tensorflow-gpu version? Sorry that this post is becoming longer than expected. Thank you for your time.
When solving the Glove issue, I just found that there was something wrong with my glove_python installation. But I could not reinstall it. Solutions are not found yet.
I was also unable to install glove_python on my windows 10 virtual env - python 3.6.9, I switched to ubuntu with python 3.6.9, I was able to install glove_python. Here are my drivers' installation:
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.243
nvidia-smi
Tue Jun 9 20:11:37 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 16W / 250W | 396MiB / 7981MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1186 G /usr/lib/xorg/Xorg 158MiB |
| 0 N/A N/A 1448 G /usr/bin/gnome-shell 131MiB |
| 0 N/A N/A 1901 G ...mviewer/tv_bin/TeamViewer 2MiB |
| 0 N/A N/A 2163 G ...AAAAAAAAA= --shared-files 59MiB |
| 0 N/A N/A 2729 G ...oken=11205890301177880090 39MiB |
+-----------------------------------------------------------------------------+
Please take a look, right now I can run temporarily with CPU, since the GPU_Flag can not be set. Thank you for your time
tf.test.is_gpu_available()
2020-06-09 20:49:04.371790: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-09 20:49:04.403490: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-06-09 20:49:04.403715: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2674db0 executing computations on platform Host. Devices:
2020-06-09 20:49:04.403730: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0):
tf.test.is_gpu_available() 2020-06-09 20:49:04.371790: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-06-09 20:49:04.403490: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2020-06-09 20:49:04.403715: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2674db0 executing computations on platform Host. Devices: 2020-06-09 20:49:04.403730: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , False , I am trying to downgrade cuda 10.1 to 10.0 followed this link
Cuda 10.1 is fine. There is no issue with your GPU setting. At least I cannot tell... But the output of the tf.test.is_gpu_available() does not show the GPU device. The following is my output:
>>> tf.test.is_gpu_available() 2020-06-10 11:16:45.576210: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2020-06-10 11:16:45.602357: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll 2020-06-10 11:16:45.936833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-10 11:16:45.941222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7845 pciBusID: 0000:0a:00.0 2020-06-10 11:16:45.941391: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2020-06-10 11:16:45.947961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1 2020-06-10 11:16:53.403187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-10 11:16:53.403271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2020-06-10 11:16:53.403531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N 2020-06-10 11:16:53.403562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N 2020-06-10 11:16:53.449045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 10603 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:09:00.0, compute capability: 6.1) 2020-06-10 11:16:53.470769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 6808 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:0a:00.0, compute capability: 6.1) True
Dear @DanielLin1986 , Thank you very much for your reply, Sorry, I have been trying to make tensorflow-gpu to run, finally by using cuda-nvidia docker, and put tensorflow-gpu 1.14.0 on top of it, I managed to run the code with GPU. However, at the training process now it showed CUDNN error as below:
[INFO] Model structure loaded.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 1000) 0
_________________________________________________________________
embedding_1 (Embedding) (None, 1000, 100) 22900
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM) (None, 1000, 128) 117760
_________________________________________________________________
dropout_1 (Dropout) (None, 1000, 128) 0
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM) (None, 1000, 128) 132096
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 8256
_________________________________________________________________
dense_2 (Dense) (None, 32) 2080
_________________________________________________________________
dense_3 (Dense) (None, 1) 33
=================================================================
Total params: 283,125
Trainable params: 260,225
Non-trainable params: 22,900
_________________________________________________________________
Train on 4 samples, validate on 2 samples
W0617 08:48:36.500040 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
W0617 08:48:36.500195 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Epoch 1/100
2020-06-17 08:48:36.744389: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-06-17 08:48:36.849971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-17 08:48:37.125541: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-06-17 08:48:37.125719: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1329 : Unknown: Fail to find the dnn implementation.
Traceback (most recent call last):
File "main.py", line 34, in <module>
helper.exec()
File "/home/Share/FunctionLevelVulnerabilityDetectionUpgrading/src/helper.py", line 340, in exec
class_weight = class_weights)
File "/usr/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
validation_steps=validation_steps)
File "/usr/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Fail to find the dnn implementation.
[[{{node cu_dnnlstm_1/CudnnRNN}}]]
(1) Unknown: Fail to find the dnn implementation.
[[{{node cu_dnnlstm_1/CudnnRNN}}]]
[[loss/mul/_99]]
0 successful operations.
0 derived errors ignored.
Can you take a look at this and let me know what went wrong here. Best regards Hai Nguyen.
Dear @DanielLin1986 , Thank you very much for your reply, Sorry, I have been trying to make tensorflow-gpu to run, finally by using cuda-nvidia docker, and put tensorflow-gpu 1.14.0 on top of it, I managed to run the code with GPU. However, at the training process now it showed CUDNN error as below:
[INFO] Model structure loaded. _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 1000) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 1000, 100) 22900 _________________________________________________________________ cu_dnnlstm_1 (CuDNNLSTM) (None, 1000, 128) 117760 _________________________________________________________________ dropout_1 (Dropout) (None, 1000, 128) 0 _________________________________________________________________ cu_dnnlstm_2 (CuDNNLSTM) (None, 1000, 128) 132096 _________________________________________________________________ global_max_pooling1d_1 (Glob (None, 128) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 128) 0 _________________________________________________________________ dense_1 (Dense) (None, 64) 8256 _________________________________________________________________ dense_2 (Dense) (None, 32) 2080 _________________________________________________________________ dense_3 (Dense) (None, 1) 33 ================================================================= Total params: 283,125 Trainable params: 260,225 Non-trainable params: 22,900 _________________________________________________________________ Train on 4 samples, validate on 2 samples W0617 08:48:36.500040 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead. W0617 08:48:36.500195 140595917236032 deprecation_wrapper.py:119] From /usr/lib/python3.6/site-packages/keras/callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead. Epoch 1/100 2020-06-17 08:48:36.744389: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2020-06-17 08:48:36.849971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2020-06-17 08:48:37.125541: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-06-17 08:48:37.125719: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1329 : Unknown: Fail to find the dnn implementation. Traceback (most recent call last): File "main.py", line 34, in <module> helper.exec() File "/home/Share/FunctionLevelVulnerabilityDetectionUpgrading/src/helper.py", line 340, in exec class_weight = class_weights) File "/usr/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit validation_steps=validation_steps) File "/usr/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop outs = f(ins_batch) File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__ return self._call(inputs) File "/usr/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(*array_vals) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__ run_metadata_ptr) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Fail to find the dnn implementation. [[{{node cu_dnnlstm_1/CudnnRNN}}]] (1) Unknown: Fail to find the dnn implementation. [[{{node cu_dnnlstm_1/CudnnRNN}}]] [[loss/mul/_99]] 0 successful operations. 0 derived errors ignored.
Can you take a look at this and let me know what went wrong here. Best regards Hai Nguyen.
Hi Hai,
This is because the GPU could not load the cuDNN library. The docker needs the Nvidia cuDNN library. Please try to use the LSTM instead of the cuDNNLSTM if you have the difficulty of installing the cuDNN library.
Dear @DanielLin1986 , Thank you for your reply.
2020-06-17 08:48:36.849971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-06-17 08:48:37.125541: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
It successfully loaded libcudnn here, do you think that it may because of GPU setting config.gpu_options.allow_growth = True
as in this link
If I use LSTM instead of CuDNNLSTM, then I am afraid the whole point of using GPU for speeding up the training process will not be possible. Best, Hai Nguyen.
Dear @DanielLin1986 ,
Thank you very much for your support,
after adding $ export TF_FORCE_GPU_ALLOW_GROWTH=true
I was able to run with CuDNNLSTM now. To be honest, I have not fully understood this setting here.
I will let you know further after trying out with other models and the validation phase.
Best regards,
Hai Nguyen.
Dear @DanielLin1986 , @cybercodeintelligence Thank you for your support as always. After fixing the CuDNN error, I went to check the embedding with Glove problem, I have fixed it as follows: in embedding.py
def TrainGlove(self, data_list):
from glove import Corpus, Glove
# creating a corpus object
print ("----------------------------------------")
print ("Start training the GLoVe model. Please wait.. ")
corpus = Corpus()
corpus.fit(data_list, window=self.glove_window)
glove = Glove(no_components=self.components, learning_rate=self.glove_learning_rate)
glove.fit(corpus.matrix, epochs=self.glove_epoch, no_threads=self.n_workers, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save(self.tokenizer_saved_path + 'glove.model') # This is to save the model as a pkl file.
# Save Glove model as .txt format for checking content
vector_size = self.components
with open(self.tokenizer_saved_path + 'results_glove.txt', "w") as f:
for word in glove.dictionary:
f.write(word)
f.write(" ")
for i in range(0, vector_size):
f.write(str(glove.word_vectors[glove.dictionary[word]][i]))
f.write(" ")
f.write("\n")
print("GLOVE SAVE HERE",self.tokenizer_saved_path + 'glove.model')
print ("Model training completed!")
print ("----------------------------------------")
def ApplyGlove(self, word_index):
print("GLOVE OPEN HERE",self.tokenizer_saved_path + 'glove.model')
from glove import Corpus, Glove
glove_model = Glove.load(self.tokenizer_saved_path + 'glove.model')
print('glove model',glove_model.dictionary)
print('glove model',glove_model.word_vectors)
key_list = list(glove_model.dictionary.keys())
word_vector_list = glove_model.word_vectors.tolist()
# with open(self.tokenizer_saved_path + 'glove.model', 'rb') as f:
# glove_model = pickle.load(f)
# key_list = list(glove_model['dictionary'].keys())
# word_vector_list = glove_model['word_vectors'].tolist()
embeddings_index = {}
for index, item in enumerate(key_list):
word = key_list[index]
coefs = np.asarray(word_vector_list[index], dtype='float32')
embeddings_index[word] = coefs
print('Loaded %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((len(word_index) + 1, self.components))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
return embedding_matrix, self.components
then in helper.py
embedding_matrix, embedding_dim = embedding_model.ApplyGlove(word_index)
With the above modifications, I could train the data with Glove
. Would you mind to take a look at this fix and see if it is legitimate ?
Also, when running experiments on function-level
, is that okay to arrange all the functions files from 9 projects into 1 data folder, then run the python scripts, pointing to that data folder ?
Thank you for your time. Best regards, Hai Nguyen.
Dear @DanielLin1986 , Thank you very much for your support, after adding
$ export TF_FORCE_GPU_ALLOW_GROWTH=true
I was able to run with CuDNNLSTM now. To be honest, I have not fully understood this setting here. I will let you know further after trying out with other models and the validation phase. Best regards, Hai Nguyen.
Hi Hai,
This is a great finding!
The "TF_FORCE_GPU_ALLOWGROWTH=true" prevents the GPU to allocate all its memory to the process. But, the errors "Fail to find the dnn implementation." seemed to be related with the DNN implementation. I did not expect that adding this line would help, which is great anyway! ^^ To be frank, I also do not fully know how this works... Please update me any new findings.
Yes, you're right. If you can use the CuDNN LSTM, the training process will be significantly faster.
Best regards,
Daniel Lin
Dear @DanielLin1986 , @cybercodeintelligence Thank you for your support as always. After fixing the CuDNN error, I went to check the embedding with Glove problem, I have fixed it as follows: in embedding.py
def TrainGlove(self, data_list): from glove import Corpus, Glove # creating a corpus object print ("----------------------------------------") print ("Start training the GLoVe model. Please wait.. ") corpus = Corpus() corpus.fit(data_list, window=self.glove_window) glove = Glove(no_components=self.components, learning_rate=self.glove_learning_rate) glove.fit(corpus.matrix, epochs=self.glove_epoch, no_threads=self.n_workers, verbose=True) glove.add_dictionary(corpus.dictionary) glove.save(self.tokenizer_saved_path + 'glove.model') # This is to save the model as a pkl file. # Save Glove model as .txt format for checking content vector_size = self.components with open(self.tokenizer_saved_path + 'results_glove.txt', "w") as f: for word in glove.dictionary: f.write(word) f.write(" ") for i in range(0, vector_size): f.write(str(glove.word_vectors[glove.dictionary[word]][i])) f.write(" ") f.write("\n") print("GLOVE SAVE HERE",self.tokenizer_saved_path + 'glove.model') print ("Model training completed!") print ("----------------------------------------") def ApplyGlove(self, word_index): print("GLOVE OPEN HERE",self.tokenizer_saved_path + 'glove.model') from glove import Corpus, Glove glove_model = Glove.load(self.tokenizer_saved_path + 'glove.model') print('glove model',glove_model.dictionary) print('glove model',glove_model.word_vectors) key_list = list(glove_model.dictionary.keys()) word_vector_list = glove_model.word_vectors.tolist() # with open(self.tokenizer_saved_path + 'glove.model', 'rb') as f: # glove_model = pickle.load(f) # key_list = list(glove_model['dictionary'].keys()) # word_vector_list = glove_model['word_vectors'].tolist() embeddings_index = {} for index, item in enumerate(key_list): word = key_list[index] coefs = np.asarray(word_vector_list[index], dtype='float32') embeddings_index[word] = coefs print('Loaded %s word vectors.' % len(embeddings_index)) embedding_matrix = np.zeros((len(word_index) + 1, self.components)) for word, i in word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector return embedding_matrix, self.components
then in helper.py
embedding_matrix, embedding_dim = embedding_model.ApplyGlove(word_index)
With the above modifications, I could train the data withGlove
. Would you mind to take a look at this fix and see if it is legitimate ?Also, when running experiments on
function-level
, is that okay to arrange all the functions files from 9 projects into 1 data folder, then run the python scripts, pointing to that data folder ?Thank you for your time. Best regards, Hai Nguyen.
Hi Hai,
Much appreciated for your help and contribution! I have read the code you wrote and I did not see any issues. Currently, I cannot run the code since my machine had problems installing glove. If you can successfully run the code, that means everything is fine!
It is okay to arrange all the function files from 9 projects into 1 data folder. Please change the path of the codebase to the data folder and it should be fine.
You are free to modify the code according to your needs. ^_^
Best regards,
Daniel Lin
Dear @DanielLin1986 @cybercodeintelligence , Thank you very much for your reply. So far, I could learn a lot from your paper and your project. It is really helpful for such newbie in this research field like me. The code is running okay on tensorfow 1.14-gpu, so I try it on tensorflow 2.x , but it showed an error, which you may already acknowledge:
src/DataLoader.py", line 119, in SavedPickle
pickle.dump(file_to_save, handle)
TypeError: can't pickle _thread.RLock objects
May I ask what was the purpose of saving the pickle file during training since we already save the model as .h5 file ? Will removing the Tensorboard
part in the call back
list affect to evaluation phase ?
Also, as the paper mentioned about the metric top-k percentage as a list of retrieved functions accounted for k% total functions in the test set , how can I customize the top-k percentage (10%,20%,50%) and check them after training process ?
Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen
Dear @DanielLin1986 @cybercodeintelligence , Thank you very much for your reply. So far, I could learn a lot from your paper and your project. It is really helpful for such newbie in this research field like me. The code is running okay on tensorfow 1.14-gpu, so I try it on tensorflow 2.x , but it showed an error, which you may already acknowledge:
src/DataLoader.py", line 119, in SavedPickle pickle.dump(file_to_save, handle) TypeError: can't pickle _thread.RLock objects
May I ask what was the purpose of saving the pickle file during training since we already save the model as .h5 file ? Will removing the
Tensorboard
part in thecall back
list affect to evaluation phase ?Also, as the paper mentioned about the metric top-k percentage as a list of retrieved functions accounted for k% total functions in the test set , how can I customize the top-k percentage (10%,20%,50%) and check them after training process ?
Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen
Hi Hai,
The pickle.dump() function in the DataLoader.py aims to save the processed code sequence. This function is not used in the DataLoader.py. If it causes the error, you can comment it. Pickle is a Python module. If it is okay on TensorFlow 1.x, it should be okay on TensorFlow 2.x. Please have a look.
Yes, you can customize the top-k percentage (10%,20%,50%). The result is a list of retrieved functions with probabilities of being vulnerable. You can rank the functions based on their probabilities. If your test set contains 100 samples, to obtain the top-10% is to get the first 10 samples ranked by their probabilities of being vulnerable (10 most probable ones).
Best regards,
Daniel
Dear @DanielLin1986, Thank you very much for your support. Sorry that it took me a while to reproduce the experiments.
Eventually, on tensorflow 1.14.1, using the provided dataset of 9 projects. I manage to train Bi-LSTM
model with 3 embedding methods Word2vec
, Glove
and Fasttext
.
At this moment, I am now at step 2: running test for the trained models. I arrange all the vulnerable and non-vulnerable functions into a folder, then run test for the trained models with test_set_path
navigate to that folder (the same folder using for training process). For example, I used the proposed dataset to trained Bi-LSTM model with Word2vec embedding methode, and run test on it, I could receive the csv result. and here was its log
61168/61168 [==============================] - 101s 2ms/step
[INFO] bilstm classification result:
[INFO] Total accuracy: 0.9758565369106641
[INFO] ----------------------------------------------------
[INFO] The confusion matrix:
[[59691 4]
[ 1473 0]]
precision recall f1-score support
Non-vulnerable 0.98 1.00 0.99 59695
Vulnerable 0.00 0.00 0.00 1473
accuracy 0.98 61168
macro avg 0.49 0.50 0.49 61168
weighted avg 0.95 0.98 0.96 61168
Here, may I confirm whether I understand correctly how to obtain top-k percentage metric or not? For example : Top 10%
Here I ran test on 61168
files, top 10%
would be around 6116
files with the highest Probs. of being vulnerable
based on the collected csv file, then based on its label, TP@k% are those actual vulnerable functions and FP@k% are those false vulnerable functions among 6116
files. Then if FN@k% refers to the true vulnerable samples missed by the model when returning k% functions, would FN@k% be equal to FP@k% ? I apologize that this comment was abit long since I just want to confirm my understanding of your paper.
Also, regarding to the SARD dataset, since the vulnerable functions file names do not have the key word 'cve' or 'CVE' for generating label, do I need to modify the GenerateLabels
function to catch different key words such as 'bad' for vulnerable, and 'good' for non-vulnerable ones ?
Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen
Dear @DanielLin1986 , @cybercodeintelligence Thank you very much for your support as always.
Formulas in the paper: P@K% = TP@k% / (TP@k% + FP@k%) R@K% = TP@k% / (TP@k% + FN@k%) For example: the total number of test functions was 61168, vulnerable: 1473 Top 10% will get around 6116 files with the highest vulnerable probs, but only 1470 vulnerable files were truly vulnerable, 3 files were missing, and 4646 non-vulnerable files were retrieved. So please correct me if I am wrong: TP@k% = 1470 FP@k% = 4646 FN@k% = 3 then P@K and R@K will be calculated as the above formulas.
Thank you very much for your time Looking forward to hearing from you. Hai Nguyen.
Dear @DanielLin1986, Thank you very much for your support. Sorry that it took me a while to reproduce the experiments.
Eventually, on tensorflow 1.14.1, using the provided dataset of 9 projects. I manage to train
Bi-LSTM
model with 3 embedding methodsWord2vec
,Glove
andFasttext
. At this moment, I am now at step 2: running test for the trained models. I arrange all the vulnerable and non-vulnerable functions into a folder, then run test for the trained models withtest_set_path
navigate to that folder (the same folder using for training process). For example, I used the proposed dataset to trained Bi-LSTM model with Word2vec embedding methode, and run test on it, I could receive the csv result. and here was its log61168/61168 [==============================] - 101s 2ms/step [INFO] bilstm classification result: [INFO] Total accuracy: 0.9758565369106641 [INFO] ---------------------------------------------------- [INFO] The confusion matrix: [[59691 4] [ 1473 0]] precision recall f1-score support Non-vulnerable 0.98 1.00 0.99 59695 Vulnerable 0.00 0.00 0.00 1473 accuracy 0.98 61168 macro avg 0.49 0.50 0.49 61168 weighted avg 0.95 0.98 0.96 61168
Here, may I confirm whether I understand correctly how to obtain top-k percentage metric or not? For example : Top 10% Here I ran test on
61168
files,top 10%
would be around6116
files with the highestProbs. of being vulnerable
based on the collected csv file, then based on its label, TP@k% are those actual vulnerable functions and FP@k% are those false vulnerable functions among6116
files. Then if FN@k% refers to the true vulnerable samples missed by the model when returning k% functions, would FN@k% be equal to FP@k% ? I apologize that this comment was abit long since I just want to confirm my understanding of your paper.Also, regarding to the SARD dataset, since the vulnerable functions file names do not have the key word 'cve' or 'CVE' for generating label, do I need to modify the
GenerateLabels
function to catch different key words such as 'bad' for vulnerable, and 'good' for non-vulnerable ones ?Thank you very much for your time. Looking forward to hearing from you. Hai Nguyen
Hi Hai,
Sorry for the late reply. Yes, regarding the Top-k part, I think you have correctly understood it.
The vulnerable functions of the SARD data are all in one folder and the non-vulnerable ones are in another folder, which is different from the real-world samples from 9 open-source projects. The GenerateLabels function is for generating labels for the data from 9 open-source projects.
You will have to tune the networks since the result you provided showed that the precision and recall of the vulnerable class being all ZERO.
Best regards,
Daniel
Dear @DanielLin1986 ,
Thank you for your reply,
It seems like I misunderstood how to use the benchmark properly. If the GenerateLables
function is only the 9 projects dataset, then how do we get the labels after loading the SARD dataset ?
I am not sure how to do proper tuning for the model, the test results, which I collected after training Bi-LSTM using word2vec embedding on the 9 projects, showed very low precision for top k metrics.
Thank you for your time. Looking forward to hearing from you. Best regards Hai
Dear @DanielLin1986 , Thank you for your reply, It seems like I misunderstood how to use the benchmark properly. If the
GenerateLables
function is only the 9 projects dataset, then how do we get the labels after loading the SARD dataset ?I am not sure how to do proper tuning for the model, the test results, which I collected after training Bi-LSTM using word2vec embedding on the 9 projects, showed very low precision for top k metrics.
Thank you for your time. Looking forward to hearing from you. Best regards Hai
Hi Hai,
Apologies for the late reply.
Obtaining the labels from the SARD dataset may require you to write the code. For example, you can specify the SARD data from the vulnerable function folder as "1", and the data from the non-vulnerable function folder as "0".
If you use the SARD dataset or the 9-project dataset, please set the using_separate_test_set
in the configuration file to False. Then, the code will automatically partition the dataset into training, validation, and test sets. Then, just wait to see the results. When you set the using_separate_test_set
to False, you just need to specify where your SARD dataset or the 9-project dataset is. ^_^
If you set the test_set_path
to True, the code will only partition the dataset into the training set and the validation set, which means you have to specify a path where the test set is stored, so the trained model will test on your specified test set.
Best regards,
Daniel
Dear @DanielLin1986 , Thank you very much for your reply,
If you use the SARD dataset or the 9-project dataset, please set the
using_separate_test_set
in the configuration file to False. Then, the code will automatically partition the dataset into training, validation, and test sets. Then, just wait to see the results. When you set theusing_separate_test_set
to False, you just need to specify where your SARD dataset or the 9-project dataset is.
This requires that when I trained with the SARD dataset or 9-project dataset, I should put them in the default data folder data/
. Am I right? Because as I check that when testing the trained model, the code is not allowed data folder argument anymore. This means that either set the using_separate_test_set
to True then state the location of the dataset or set it to False, and it will test the data in the default folder. However, I think that the current code is not supporting the SARD dataset, and I still need to modify the Label function for this case whether or not I set using_separate_test_set
to True or False
If you set the test_set_path to True, the code will only partition the dataset into the training set and the validation set, which means you have to specify a path where the test set is stored, so the trained model will test on your specified test set.
If setting the test_set_path
to True, the code will partition the dataset to training and validation, but I think it happens only in step 1 - training. In step 2, should the loaded model just go straight to test the separate test set?
I have also sent you an email last week, I would like to confirm if you were able to receive it? Thank you for your time. Looking forward to hearing from you. Best regards Hai Nguyen
Dear @DanielLin1986 , thank you for your last reply,
I set using_separate_test_set
to False and put all the 9-project dataset into the default data folder data/
, I was able to get the following results with Bi-LSTM and word2vec embedding method :
precision recall f1-score support
Non-vulnerable 0.98 1.00 0.99 11916
Vulnerable 0.77 0.39 0.52 318
accuracy 0.98 12234
macro avg 0.87 0.69 0.75 12234
weighted avg 0.98 0.98 0.98 12234
It looks better than the previously posted result, but I have not changed any hyperparameters here, so I am not sure that tuning can help to improve the above results or not.
I also want to try out with the SARD dataset to reproduce your work, so far I understand the labeling idea in your last reply. I just want to ask if anything I need to notice before adding the SARD label generating function such as when labeling data before training and testing.
Thank you for your time Best regards Hai Nguyen
Dear @DanielLin1986 , thank you for your last reply,
I set
using_separate_test_set
to False and put all the 9-project dataset into the default data folderdata/
, I was able to get the following results with Bi-LSTM and word2vec embedding method :precision recall f1-score support Non-vulnerable 0.98 1.00 0.99 11916 Vulnerable 0.77 0.39 0.52 318 accuracy 0.98 12234 macro avg 0.87 0.69 0.75 12234 weighted avg 0.98 0.98 0.98 12234
It looks better than the previously posted result, but I have not changed any hyperparameters here, so I am not sure that tuning can help to improve the above results or not.
I also want to try out with the SARD dataset to reproduce your work, so far I understand the labeling idea in your last reply. I just want to ask if anything I need to notice before adding the SARD label generating function such as when labeling data before training and testing.
Thank you for your time Best regards Hai Nguyen
Hi Hai,
Apologies again for the late reply.
Well done! You have a result. But, I am afraid that tuning and optimizing a neural network is a challenging task. I am also working on it. I think how to tune and optimize the hyperparameters can be experience-driven and heuristic (I may be wrong.). You can try by starting from modifying the optimizer, the batch_size, and the number of the epoch.
At the current stage, the data, particularly the vulnerable data, is insufficient for training a neural network. Therefore, to obtain more labeled data for training can definitely contribute to a better detection result.
With regard to the SARD data, I think you have everything ready as long as you can obtain the SARD label. One thing you may need to consider is the "good" and "bad" phrases that appear in the function body. These phrases may bias the model. You can have a try first on the SARD data anyway.
Best regards,
Daniel LIn
Dear @DanielLin1986 , @cybercodeintelligence, Thank you very much for your reply. I will run more experiments to tune the neural network for the 9 projects dataset as you advised.
Regarding the SARD data, thanks to the organized folders ( non-vul and vul ) as you provided. Writing python scripts for labeling them can be done without problems. However, without you mentioned the word 'bad' and 'good' in the function body, I could not see the model bias here. Thank you so much! The NLP embedding may rely on those words to predict the vulnerability but not the function content. May I ask if I create a script to replaces those words (good, bad) with dummy names to train the classifier? Also if it is possible, could you advise me on how to get the test results for the SARD dataset as in your paper?
Thank you for your time. With best regards Hai Nguyen
Dear @DanielLin1986 , @cybercodeintelligence, Thank you very much for your reply. I will run more experiments to tune the neural network for the 9 projects dataset as you advised.
Regarding the SARD data, thanks to the organized folders ( non-vul and vul ) as you provided. Writing python scripts for labeling them can be done without problems. However, without you mentioned the word 'bad' and 'good' in the function body, I could not see the model bias here. Thank you so much! The NLP embedding may rely on those words to predict the vulnerability but not the function content. May I ask if I create a script to replaces those words (good, bad) with dummy names to train the classifier? Also if it is possible, could you advise me on how to get the test results for the SARD dataset as in your paper?
Thank you for your time. With best regards Hai Nguyen
Hi Hai,
No worries.
Yes. Using dummy names to replace the "good" "bad" words was exactly what I did.
To automatically obtain the test results for the SARD dataset may not be easy. What I did was:
Hopefully, the above description answers your question.
Best regards,
Daniel Lin.
Dear @DanielLin1986 , cc @cybercodeintelligence , That was a great idea, Thank you very much for your help.
I will proceed to process the SARD dataset as your advice, I hope that I can reproduce the result for the SARD dataset as soon as I can this week and get back to you.
Thank you for your time. With best regards Hai Nguyen
Dear @DanielLin1986 , cc @cybercodeintelligence , That was a great idea, Thank you very much for your help.
I will proceed to process the SARD dataset as your advice, I hope that I can reproduce the result for the SARD dataset as soon as I can this week and get back to you.
Thank you for your time. With best regards Hai Nguyen
Hi Hai,
Good luck. Hope that everything goes well.
Best regards,
Daniel Lin
Dear @DanielLin1986 , @cybercodeintelligence , Thank you very much for your last reply,
I have added a script to process the SARD dataset to remove these keywords: 'bad', 'BAD', 'GOOD', 'good'. Then, label them followed its file name. Here are the logs I collected when running the test for the trained model using word2vec:
Total number of vulnerable functions is: 3318/15000
TOP 1.00% 150 files
value 1 was found 150 times
The True positive samples TP@k% is 150
value 0 was found 0 times
The False positive samples FP@k% is 0
The top-k percentage P@K% is 100.00%
The top-k percentage R@K% is 4.52%
TOP 10.00% 1500 files
value 1 was found 1500 times
The True positive samples TP@k% is 1500
value 0 was found 0 times
The False positive samples FP@k% is 0
The top-k percentage P@K% is 100.00%
The top-k percentage R@K% is 45.21%
TOP 20.00% 3000 files
value 1 was found 2673 times
The True positive samples TP@k% is 2673
value 0 was found 327 times
The False positive samples FP@k% is 327
The top-k percentage P@K% is 89.10%
The top-k percentage R@K% is 80.56%
TOP 50.00% 7500 files
value 1 was found 3317 times
The True positive samples TP@k% is 3317
value 0 was found 4183 times
The False positive samples FP@k% is 4183
The top-k percentage P@K% is 44.23%
The top-k percentage R@K% is 99.97%
This is very close to the results you had introduced in the paper, I think the difference was the number of vulnerable files. Here, Please correct me if I am wrong, I think the 'train_test_split' function in 'train_test_split' automatically shuffle and separate the data followed the Test_set_ratio
. So, do we still need to mix the data before use for training?
It has been a productive time to follow your work. I noticed that the paper did not mention the 'attention' method along with 'Elmo' model implementation, may I ask for the idea on how to implement the 'attention' method for deep learning model as well as possibilities to extend your benchmark with other language models like BERT or Elmo.
Thank you very much for your time. With best regards Nguyen Hai.
Dear @DanielLin1986 , @cybercodeintelligence , Thank you very much for your last reply,
I have added a script to process the SARD dataset to remove these keywords: 'bad', 'BAD', 'GOOD', 'good'. Then, label them followed its file name. Here are the logs I collected when running the test for the trained model using word2vec:
Total number of vulnerable functions is: 3318/15000
TOP 1.00% 150 files value 1 was found 150 times The True positive samples TP@k% is 150 value 0 was found 0 times The False positive samples FP@k% is 0 The top-k percentage P@K% is 100.00% The top-k percentage R@K% is 4.52% TOP 10.00% 1500 files value 1 was found 1500 times The True positive samples TP@k% is 1500 value 0 was found 0 times The False positive samples FP@k% is 0 The top-k percentage P@K% is 100.00% The top-k percentage R@K% is 45.21% TOP 20.00% 3000 files value 1 was found 2673 times The True positive samples TP@k% is 2673 value 0 was found 327 times The False positive samples FP@k% is 327 The top-k percentage P@K% is 89.10% The top-k percentage R@K% is 80.56% TOP 50.00% 7500 files value 1 was found 3317 times The True positive samples TP@k% is 3317 value 0 was found 4183 times The False positive samples FP@k% is 4183 The top-k percentage P@K% is 44.23% The top-k percentage R@K% is 99.97%
This is very close to the results you had introduced in the paper, I think the difference was the number of vulnerable files. Here, Please correct me if I am wrong, I think the 'train_test_split' function in 'train_test_split' automatically shuffle and separate the data followed the
Test_set_ratio
. So, do we still need to mix the data before use for training? It has been a productive time to follow your work. I noticed that the paper did not mention the 'attention' method along with 'Elmo' model implementation, may I ask for the idea on how to implement the 'attention' method for deep learning model as well as possibilities to extend your benchmark with other language models like BERT or Elmo.Thank you very much for your time. With best regards Nguyen Hai.
Hi Hai,
You are welcome. Thanks for being interested in our work, again.
Your results look good. According to your results, using the framework, examining 50% of the functions could identify 99.97% of the vulnerable functions (Top 50% recall was 99.97%). This is very close to our results.
Yes. I also think that the 'train_test_split' function performs the shuffle and the partition of the data sets. When using the Six-project data sets, I mixed the data to make sure that the data from every project exist in the training, validation, and test sets. Therefore, I think you can mix the data based on your experimental settings.
Yes. You are right. The 'attention' and 'ELMo' were added after the completion of the paper. The attention (HAN) mechanism that I am using is implemented by Luiz Felix (Github link: https://github.com/lzfelix/keras_attention). Please feel free to use the code. You can also use other types of attention mechanisms. For example, this paper: 'Software Defect Prediction via Attention-Based Recurrent Neural Network' uses the self-attention mechanism (Correct me if I am wrong).
The BERT and ELMo are good ideas! I am also working on it. Personally, I think using these language models as the code embedding solutions are feasible. However, the ELMo embedding results are not significantly better compared with the current deep learning-based methods. The BERT is challenging because you need very powerful GPUs. Anyway, how to effectively explore the potentials of these language models for code embeddings is still an open question. Maybe we need to customize the language models and/or the neural network to fit the code. Welcome to discuss.
Best regards,
Daniel Lin
Dear Mr. Daniel Lin, cc: @DanielLin1986 , @cybercodeintelligence Thank you very much for your detailed explanation!
During my surveying period of applying deep learning methods to detect code vulnerabilities, I have found your papers, together with your projects, to be greatly helpful for my graduate study. I am currently taking my MS degree at Ritsumeikan University in Japan, and I would like to know whether it is possible to use your project as the foundation for my own research? I would love to have further discussion with you when you have free time.
Thank you very much for your time. With best regards Hai Nguyen
Dear Mr. Daniel Lin, cc: @DanielLin1986 , @cybercodeintelligence Thank you very much for your detailed explanation!
During my surveying period of applying deep learning methods to detect code vulnerabilities, I have found your papers, together with your projects, to be greatly helpful for my graduate study. I am currently taking my MS degree at Ritsumeikan University in Japan, and I would like to know whether it is possible to use your project as the foundation for my own research? I would love to have further discussion with you when you have free time.
Thank you very much for your time. With best regards Hai Nguyen
Hi Hai,
No worries. It is our pleasure.
Yes. Of course. You are welcomed to use our projects, including our code and data, to start your research, as long as you kindly cite our papers. You can use, modify, and improve the code. However, please be noted that if you would like to share our data with your classmates/fellow researchers, please let us know.
Yes. Discussions and comments are welcomed. Thank you for being interested in our projects.
All the best!
Regards,
Daniel Lin
Dear Mr. Daniel Lin, cc: @DanielLin1986 , @cybercodeintelligence Thank you very much for your detailed explanation! During my surveying period of applying deep learning methods to detect code vulnerabilities, I have found your papers, together with your projects, to be greatly helpful for my graduate study. I am currently taking my MS degree at Ritsumeikan University in Japan, and I would like to know whether it is possible to use your project as the foundation for my own research? I would love to have further discussion with you when you have free time. Thank you very much for your time. With best regards Hai Nguyen
Hi Hai,
No worries. It is our pleasure.
Yes. Of course. You are welcomed to use our projects, including our code and data, to start your research, as long as you kindly cite our papers. You can use, modify, and improve the code. However, please be noted that if you would like to share our data with your classmates/fellow researchers, please let us know.
Yes. Discussions and comments are welcomed. Thank you for being interested in our projects.
All the best!
Regards,
Daniel Lin
Dear @DanielLin1986 , @cybercodeintelligence , Thank you very much for your answer,
This is great news for me. Yes, and of course, I will properly cite and inform you if anyone wants to access your data. Thank you again for your kind help and generosity. I am the only one in my lab to do this research and to be honest, I learn the most from your papers and experiments. Recently, I added the pre-trained GloVe and pre-trained BERT to the embedding module. However, I notice that changing the embedding method did not really improve the precision or recall for the Nine-projects dataset. The same result happens for the SARD dataset. So, I am thinking of trying out with different deep learning models like GRU and Bi-GRU instead of LSTM and Bi-LSTM. May I ask if you think of any other deep models or changing the current model structures that could potentially improve the current results, it would be very useful for me to hear from your advice.
I have also read your recent paper too, DeepBalance: Deep-Learning and Fuzzy Oversampling for Vulnerability Detection
, it was a great idea to rebalance the dataset with the fuzzy oversampling method. It would be great if you could let me know if there is a conference or online workshop that you are going to attend or present. I really appreciate it if I can hear you present in person.
Thank you for your time. With best regards Hai Nguyen
Dear @DanielLin1986 , @cybercodeintelligence Thank you for your last reply, Hope you are doing well.
Currently, I am working on constructing different models like the one you suggested in Software Defect Prediction via Attention-Based Recurrent Neural Network' uses the self-attention mechanism
The paper's model was built by the Bi-LSTM layer then the Attention layer. I can't help but notice that the benchmark currently employed the LSTM_with_HAN model. This one only has an LSTM layer but not a bidirectional layer as in the mentioned paper.
Also, I got the results after trying out with all of the RNNs (eg BiGRU and GRU). The results are pretty similar to Bi-LSTM and LSTM for the SARD dataset, where all top 1% and top 10% precision rates reached 100%, and the top 50% recall rate reached 100%. Do you think that was because the dataset was artificially synthesized, so almost all models could reach their satisfying results? If so, at this point, the only comparisons that are useful would only go to the detectors trained on the Nine-projects.
Thank you for your time. With best regards Hai Nguyen.
Dear @DanielLin1986 , @cybercodeintelligence Thank you for your last reply, Hope you are doing well.
Currently, I am working on constructing different models like the one you suggested in
Software Defect Prediction via Attention-Based Recurrent Neural Network' uses the self-attention mechanism
The paper's model was built by the Bi-LSTM layer then the Attention layer. I can't help but notice that the benchmark currently employed the LSTM_with_HAN model. This one only has an LSTM layer but not a bidirectional layer as in the mentioned paper.Also, I got the results after trying out with all of the RNNs (eg BiGRU and GRU). The results are pretty similar to Bi-LSTM and LSTM for the SARD dataset, where all top 1% and top 10% precision rates reached 100%, and the top 50% recall rate reached 100%. Do you think that was because the dataset was artificially synthesized, so almost all models could reach their satisfying results? If so, at this point, the only comparisons that are useful would only go to the detectors trained on the Nine-projects.
Thank you for your time. With best regards Hai Nguyen.
Hi Hai,
Glad to hear from you again.
Yes. Our experiments also revealed that the performance of LSTM/GRU and their bidirectional forms on the SARD dataset are very similar (no statistical significance).
We also believe that the patterns of the artificially synthesized samples are much easier to capture compared to the real-world samples by the neural models.
The hard part is to accurately detect the vulnerabilities on the real-world projects. Also, the binary-level detection is challenging. Currently, I am facing difficulties in obtaining more labeled real-world data. I am also seeking a neural network structure that can achieve much better detection results, which requires time and effort to explore.
Best regards,
Daniel
Dear Mr. Daniel, Thank you for your reply.
Building up the real-world data with labels is a tough task since the supervised models require a larger dataset. I will try out with different model structures on the Nine-projects dataset, and let you know when the results come out.
Please let me know if is there anything I can contribute to extending the benchmark. For example, collecting more data or implementing new model structures. I look forward to reading your coming work.
Thank you for your time. With best regards, Hai Nguyen.
Dear Sir, Great work! Thank you for sharing the project. I have been following your work for a while, it would be great to learn to extend this benchmark for other neural networks. Also, I ran into a bit of trouble when running step 1, the log is posted as the following part. After training the Word2Vec, it moves to Elmo model and somehow stuck there. Could you suggest what could be mistaken here?