Open hehehwang opened 2 years ago
It's interesting. But I'm not sure that this is the same return
. The code was tokenized by a parser, so it should handle different indentations. I may suggest that there are different sorts of string literals with return\n
inside.
i don't understand what "different sorts of string literals with return\n inside." means, but i could find out lots of '*\n' tokens in vocabulary.pkl
for example,
'EMPTY\n': 11459,
'
lots of tokens from 'token' tokens are mixed with '\n', which i assume that vocabulary parser is reading each end of the line
Yeah, seems strange. I will investigate why the parser extracted tokens with new line characters in the end.
it seems that counter in vocabulary is counting 'token' tokens with a newline character. for example, vocabulary.pkl in java-small dataset, i can find 'return': 6020684, and 'return\n': 33290, separately.
i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample, but im little confused whether this problem(mixing '\n' in tokens) is intended.
thank you!