CGCL-codes / Attack_PTMC

The dataset, source code and the results of our ESEC/FSE 2023 paper "An Extensive Study on Adversarial Attack against Pre-trained Models of Code".
MIT License
10 stars 1 forks source link

BigCloneBench dataset issue #5

Closed chunkify closed 4 months ago

chunkify commented 5 months ago

Hi there,

I am interested to know how you got the identifier names for the different programming statement categories mentioned in your paper. Did you run any scripts on the BigCloneBench dataset to extract identifier names for various programming statement categories? For example, if I ran the script get_substitutes.py, I can see that the various identifiers that were stored in the "data.csv" file used to create the substitutions for different programming statement categories. Please have a look at the following code snippet.

            identifiers = ""
            try:
                iden, code_tokens = get_identifiers(remove_comments_and_docstrings(item["func"], "java"),
                                                           "java")
            except:
                iden, code_tokens = get_identifiers(item["func"], "java")
            processed_code = " ".join(code_tokens)

            words, sub_words, keys = _tokenize(processed_code, tokenizer_mlm)

            for index in range(len(idx_list)):
                if int(item["idx"]) == idx_list[index]:
                    identifiers = All_list[index].replace(" ", "").strip('[').strip(']').split(',')
                    identifiers = [] if identifiers == [''] else identifiers
            variable_names = identifiers
            item["identifiers"] = variable_names

Here, I don't see the identifiers returned from the get_identifiers function being stored in the variable_names.

Could you please share your thoughts on this issue? If I want to test the beam attack approach on another clone detection dataset (e.g., SemanticCloneBench) how can I preprocess the dataset to get the one similar to your BigCloneBench dataset.

xhdu commented 4 months ago

Thank you for your attention. Since it is not easy to extract identifiers from different statements directly using Python, we used the Java analysis tool Spoon in our implementation. Since it is implemented in Java, we first extracted different identifiers into files and then carried out the attack. Unfortunately, as too much time has passed, I can no longer find the related code. You can refer to their official website for implementation: https://spoon.gforge.inria.fr.