Hustcw / CLAP

CLAP(Contrastive Language-Assembly Pre-training) learns transferable binary code representations with natural language supervision
39 stars 3 forks source link

question about dataset #3

Open superway117 opened 3 months ago

superway117 commented 3 months ago

about this point: "Utilizing a dataset engine capable of automatically generating 195 million pairs of code snippets and their descriptions"

  1. where can i find this dataset?
  2. what is the dataset engine ? thanks
Hustcw commented 3 months ago

Sorry, we only release the pre-trained model currently. You can find the the dataset engine description in Section 3.1 of our paper.

superway117 commented 3 months ago

i want to repeat your work on the dataset , appreciate if you could show the demo data of the dataset or provide me the script how to build the dataset

Hustcw commented 3 months ago

The compiling pipeline is complicated and it's not ready for open source, I could provide some demo data and scripts to request llm for explanation as I got some free time :) Sorry for that