bayesgroup / code_transformers

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code
Other
61 stars 18 forks source link

bug in generating data #4

Closed jxzhn closed 2 years ago

jxzhn commented 2 years ago

I think line 112 (in function separate_dps) of cc/main/src/utils/utils.py should be

        aug_asts.append([ast[i : i + max_len], i + half_len])

instead of

        aug_asts.append([ast[i : i + max_len], half_len])
serjtroshin commented 2 years ago

Hi! half_len here denotes that the first half of the snippet will be processed as memory, and the last part is for loss eval and prediction, it seems there is no bug

jxzhn commented 2 years ago

But this is inconsistent with the comments on lines 87 to 104

    """
    Handles training / evaluation on long ASTs by splitting
    them into smaller ASTs of length max_len, with a sliding
    window of max_len / 2.

    Example: for an AST ast with length 1700, and max_len = 1000,
    the output will be:
    [[ast[0:1000], 0], [ast[500:1500], 1000], [ast[700:1700], 1500]]

    Input:
        ast : List[Dictionary]
            List of nodes in pre-order traversal.
        max_len : int

    Output:
        aug_asts : List[List[List, int]]
            List of (ast, beginning idx of unseen nodes)
    """

Also inconsistent with lines 114 to 115

    idx = max_len - (len(ast) - (i + half_len))
    aug_asts.append([ast[-max_len:], idx])
serjtroshin commented 2 years ago

Oh, I see, there is seem to be an issue with the comment. The function's intent is here https://github.com/facebookresearch/code-prediction-transformer#splitting-large-trees

The function seems to be correct, also it does not follow he comments on lines 87 to 104: separate_dps(range(2022), 500) should produce [[range(0, 500), 0], [range(250, 750), 250], [range(500, 1000), 250], [range(750, 1250), 250], [range(1000, 1500), 250], [range(1250, 1750), 250], [range(1500, 2000), 250], [range(1522, 2022), 478]]

0, 250, 478 values denote the start of the token block for which we make a prediction. 250 means that we should use the first 250 tokens in a block as memory. The first block range(0, 500) is used for prediction (no memory), for intermediate blocks the first half is used as memory, for the last block, which is range(1522, 2022) the model will use elements range(1522, 2000) as memory, and make predictions for the remaining elements: range(2000, 2022). I hope it is more clear, thank you @jxzhn !

jxzhn commented 2 years ago

get it, thanks!