arthurherbout / crypto_code_detection

Automatic Detection of Custom Cryptographic C Code
8 stars 4 forks source link

Github Code Search Challenge / Transfer Learning #8

Closed arnaudstiegler closed 4 years ago

arnaudstiegler commented 4 years ago

Github has recently launched (i.e. yesterday!) a Code Search Challenge. This is tackling semantic code search which is slightly different than what we do: the aim is to retrieve functions that would correspond to a natural language query given by the user. Here is a good explanation of semantic code search.

Of course, this is off-topic for us, but there are some interesting features that we could potentially leverage for our own project:

I think the last point is the most interesting and could be very big for us: thanks to those provided baseline models, we could do some transfer learning, and this would allow us to use deep learning even though we only have a small amount of data. On top of that, we could be using the latest architecture for text representation (BERT with self-attention) which is exciting as well!

Drop a comment to tell me what you think about this!

arthurherbout commented 4 years ago

Looks very interesting!

Also getting a dataset of Github repos for free is quite nice. I will try to get the dataset tonight (I have to fight Docker installation for the first time first :)). I will keep you posted!

arthurherbout commented 4 years ago

Well I didn't succeed in installing docker because of CUDA 10.1 Something strange about my config files ( i use ubuntu 18.04 ). The docker part is easy, you guys could get in pretty easily. It was pretty scary I though I might get black screen after rebooting since you are supposed to manage everything in Ubuntu... I do not feel very good about trying it again today... If we do not get it before the end of the week i will try again

arnaudstiegler commented 4 years ago

@arthurherbout

The docker part is really just for running the existing pre-trained models for their own task (which is not ours) and submit to the competition. If the goal is just to get the dataset, I think you only have to run the download_dataset.py file (because that's ultimately what the docker does, and it's basic python).

And even for re-using their model, I think it might not be worth spending time on Docker (unless we have a Docker expert in the group!!!). They use Docker only because they host a competition.