'\n' mixed in Vocabulary['token']

JetBrains-Research / code2seq

PyTorch's implementation of the code2seq model.

MIT License

61 stars 18 forks source link

'\n' mixed in Vocabulary['token'] #111

Open hehehwang opened 2 years ago

hehehwang commented 2 years ago

it seems that counter in vocabulary is counting 'token' tokens with a newline character. for example, vocabulary.pkl in java-small dataset, i can find 'return': 6020684, and 'return\n': 33290, separately.

i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample, but im little confused whether this problem(mixing '\n' in tokens) is intended.

thank you!

SpirinEgor commented 2 years ago

It's interesting. But I'm not sure that this is the same return. The code was tokenized by a parser, so it should handle different indentations. I may suggest that there are different sorts of string literals with return\n inside.

hehehwang commented 2 years ago

i don't understand what "different sorts of string literals with return\n inside." means, but i could find out lots of '*\n' tokens in vocabulary.pkl

for example, 'EMPTY\n': 11459, '\n': 11416, 'if\n': 6900, 'exception\n': 6624, ...

lots of tokens from 'token' tokens are mixed with '\n', which i assume that vocabulary parser is reading each end of the line

SpirinEgor commented 2 years ago

Yeah, seems strange. I will investigate why the parser extracted tokens with new line characters in the end.