binaryai / sdk

Get results of binaryai.cn using our SDK
https://www.binaryai.cn/doc/
GNU General Public License v3.0
491 stars 75 forks source link

try to upload source function by "upload_fucntion" #116

Closed island255 closed 3 years ago

island255 commented 3 years ago

Hi, I'm using the command-line tool of v0.2.8 and try to create a function set. I try to upload source files and source functions. But it seems that I can only upload binary functions. And when I look into function.py, the function "upload_fucntion" has some arguments about source code. How can I use it in version 0.2.8? Or will it useable in future versions?

davendu commented 3 years ago

Hi. I think by mentioning upload_function you mean this: https://github.com/binaryai/sdk/blob/v0.2.8/binaryai/function.py#L7

def upload_function(
        client,
        name,
        feature,
        source_code=None,
        source_file=None,
        source_line=None,
        language=None,
        funcset_id=None
)

In this version, all optional arguments are meant to provide additional information when you re-query the function you have uploaded, and works like some metadata. The real thing being used for vectorizing and indexing is feature.

As for generating the feature, it was described in binaryai/ida.py#L175, and generated from decompiled function's ctree.

We would not promise the feature generation in current version is useable in future version, but, if we have another version of feature, generating it should not be harder than current version.

island255 commented 3 years ago

Thanks for your help. This is how I understand, the only useful part is the feature, which is extracted from the pseudocode generated by decompiling binary and acts as the source code.

But it seems to be different from the process when you conduct the experiments in paper CodeCMR. As I saw that the matched functions are the source code of real source projects. So I wonder if it can be realized in the future that we can construct a source dataset directly using the real source code of projects.

island255 commented 3 years ago

And also, I found that some queries will return two same source functions with different function ids. It seems it queries many function sets containing the same functions in them. Is there a unique and complete funcset that I can use? As far as I know, there didn't indicate a formal funcset when querying the public dataset.

nforest commented 3 years ago

Thanks for your help. This is how I understand, the only useful part is the feature, which is extracted from the pseudocode generated by decompiling binary and acts as the source code.

But it seems to be different from the process when you conduct the experiments in paper CodeCMR. As I saw that the matched functions are the source code of real source projects. So I wonder if it can be realized in the future that we can construct a source dataset directly using the real source code of projects.

yes, the feature used in v3.0 is pseudocode, because this project continuously evolves after we published the CodeCMR paper. I'm not sure what do you mean by "a source dataset", if it means some quries from real source, the answer is yes, you can do comparision between source functions.

island255 commented 3 years ago

Thanks for your help. This is how I understand, the only useful part is the feature, which is extracted from the pseudocode generated by decompiling binary and acts as the source code. But it seems to be different from the process when you conduct the experiments in paper CodeCMR. As I saw that the matched functions are the source code of real source projects. So I wonder if it can be realized in the future that we can construct a source dataset directly using the real source code of projects.

yes, the feature used in v3.0 is pseudocode, because this project continuously evolves after we published the CodeCMR paper. I'm not sure what do you mean by "a source dataset", if it means some quries from real source, the answer is yes, you can do comparision between source functions.

Actually, what I mean is that I can only create a dataset by decompiling binary at present, but I want to create a dataset from the source project. And the task is still binary2source matching, the source project dataset is the object that a binary query looking for. The current approach makes me feel it more like a binary2binary matching task.

nforest commented 3 years ago

Thanks for your help. This is how I understand, the only useful part is the feature, which is extracted from the pseudocode generated by decompiling binary and acts as the source code. But it seems to be different from the process when you conduct the experiments in paper CodeCMR. As I saw that the matched functions are the source code of real source projects. So I wonder if it can be realized in the future that we can construct a source dataset directly using the real source code of projects.

yes, the feature used in v3.0 is pseudocode, because this project continuously evolves after we published the CodeCMR paper. I'm not sure what do you mean by "a source dataset", if it means some quries from real source, the answer is yes, you can do comparision between source functions.

Actually, what I mean is that I can only create a dataset by decompiling binary at present, but I want to create a dataset from the source project. And the task is still binary2source matching, the source project dataset is the object that a binary query looking for. The current approach makes me feel it more like a binary2binary matching task.

It is a binary2source matching task, please try to parse source code to functions via libclang/treesitter and do comparision between pseudocode and sourcecode.

island255 commented 3 years ago

Thanks for your help. This is how I understand, the only useful part is the feature, which is extracted from the pseudocode generated by decompiling binary and acts as the source code. But it seems to be different from the process when you conduct the experiments in paper CodeCMR. As I saw that the matched functions are the source code of real source projects. So I wonder if it can be realized in the future that we can construct a source dataset directly using the real source code of projects.

yes, the feature used in v3.0 is pseudocode, because this project continuously evolves after we published the CodeCMR paper. I'm not sure what do you mean by "a source dataset", if it means some quries from real source, the answer is yes, you can do comparision between source functions.

Actually, what I mean is that I can only create a dataset by decompiling binary at present, but I want to create a dataset from the source project. And the task is still binary2source matching, the source project dataset is the object that a binary query looking for. The current approach makes me feel it more like a binary2binary matching task.

It is a binary2source matching task, please try to parse source code to functions via libclang/treesitter and do comparision between pseudocode and sourcecode.

I still have another question as mentioned above.

I found that some queries will return two same source functions with different function ids. It seems it queries many function sets containing the same functions in them. Is there a unique and complete funcset that I can use? As far as I know, there didn't indicate a formal funcset when querying the public dataset.

nforest commented 3 years ago

And also, I found that some queries will return two same source functions with different function ids. It seems it queries many function sets containing the same functions in them. Is there a unique and complete funcset that I can use? As far as I know, there didn't indicate a formal funcset when querying the public dataset.

same function text but with different ids is a known behavior, recently we have no plan to improve it due to heavy workload. Is it a wordaround to de-duplicate the results in the client side?

island255 commented 3 years ago

same function text but with different ids is a known behavior, recently we have no plan to improve it due to heavy workload. Is it a workaround to de-duplicate the results in the client side?

Yes, it can be done but with a little more cost to get a unique Top-K result.

Finally, this is an outstanding work for binary2source matching! Thanks for your help again!