BigCloneBench dataset issue

Hi there,

I am interested to know how you got the identifier names for the different programming statement categories mentioned in your paper. Did you run any scripts on the BigCloneBench dataset to extract identifier names for various programming statement categories? For example, if I ran the script get_substitutes.py, I can see that the various identifiers that were stored in the "data.csv" file used to create the substitutions for different programming statement categories. Please have a look at the following code snippet.

            identifiers = ""
            try:
                iden, code_tokens = get_identifiers(remove_comments_and_docstrings(item["func"], "java"),
                                                           "java")
            except:
                iden, code_tokens = get_identifiers(item["func"], "java")
            processed_code = " ".join(code_tokens)

            words, sub_words, keys = _tokenize(processed_code, tokenizer_mlm)

            for index in range(len(idx_list)):
                if int(item["idx"]) == idx_list[index]:
                    identifiers = All_list[index].replace(" ", "").strip('[').strip(']').split(',')
                    identifiers = [] if identifiers == [''] else identifiers
            variable_names = identifiers
            item["identifiers"] = variable_names

Here, I don't see the identifiers returned from the get_identifiers function being stored in the variable_names.

Could you please share your thoughts on this issue? If I want to test the beam attack approach on another clone detection dataset (e.g., SemanticCloneBench) how can I preprocess the dataset to get the one similar to your BigCloneBench dataset.

CGCL-codes / Attack_PTMC

BigCloneBench dataset issue #5