binaryai / CodeCMR

49 stars 15 forks source link

dataset misses source code #3

Closed island255 closed 3 years ago

island255 commented 3 years ago

The description of the dataset writes:

Each dataset has 33 columns, the first column is the source code, the other columns are the corresponding binary code on 32 combinations of different compilers (gcc/clang), different platforms (x86/x64/arm/arm64) and different optimizations (O0/O1/O2/O3).

However, the released dataset, which can be download from google cloud, seems only to have 32 binaries but no source code.

Without source code, follow-up works are hard to be carried out and compared.

Please upload the source code to help this dataset complete if possible.

zepingyu0512 commented 3 years ago

The description of the dataset writes:

Each dataset has 33 columns, the first column is the source code, the other columns are the corresponding binary code on 32 combinations of different compilers (gcc/clang), different platforms (x86/x64/arm/arm64) and different optimizations (O0/O1/O2/O3).

However, the released dataset, which can be download from google cloud, seems only to have 32 binaries but no source code.

Without source code, follow-up works are hard to be carried out and compared.

Please upload the source code to help this dataset complete if possible.

Hello, corressponding source code is in the .pkl files. How to open:

import pandas as pd df = pd.read_pickle('test.pkl') sample = df.iloc[0] src = sample['c_label']

island255 commented 3 years ago

Get it, thanks.

island255 commented 3 years ago

I have scanned the source code in the .pkl file. Every source code seems to be a function without context. So I wonder how to compile them into binaries? And I also wonder how you collect them and where they are from?

If you can open-source the complete version of this function to make compile them possible. I will be very grateful.

nforest commented 3 years ago

I have scanned the source code in the .pkl file. Every source code seems to be a function without context. So I wonder how to compile them into binaries? And I also wonder how you collect them and where they are from?

If you can open-source the complete version of this function to make compile them possible. I will be very grateful.

Actually, we compiled lots of open-source projects into binaries and split the source/binary files into the corresponding function-level pairs.

We don't have a timeline to open-source this part of our work. However, there are no magic tricks inside, just normal source compilation and preprocessing.

island255 commented 3 years ago

Actually, we are doing a similar job on source2binary code matching. But our method needs the full file of source code, as the source code of a single function cannot be analyzed. So can you open-source the full file of source code or send the source code to me? It will be beneficial to carry out a comparison experiment.

nforest commented 3 years ago

Actually, we are doing a similar job on source2binary code matching. But our method needs the full file of source code, as the source code of a single function cannot be analyzed. So can you open-source the full file of source code or send the source code to me? It will be beneficial to carry out a comparison experiment.

In our experiment, all the source code are from the packages in famous linux distributions, such as http://mirrors.ustc.edu.cn/debian/ .

island255 commented 3 years ago

Thanks.