ChangwenXu98 / TransPolymer

Implementation of "TransPolymer: a Transformer-based language model for polymer property predictions" in PyTorch
MIT License
53 stars 19 forks source link

Question about Tokenizer #14

Closed himanshisyadav closed 5 months ago

himanshisyadav commented 8 months ago

Why does the tokenizer tokenize Cl as just C?

sequence = "[O-][Cl+3]([O-])([O-])[O-]"
print(tokenizer.tokenize(sequence))
['[', 'O', '-', ']', '[', 'C', '+', '3', ']', '(', '[', 'O', '-', ']', ')', '(', '[', 'O', '-', ']', ')', '[', 'O', '-', ']']

However, the tokenizer performs differently for Li and Br.

sequence = "[Li+].[Br-]"
print(tokenizer.tokenize(sequence))
['[', 'Li', '+', ']', '.', '[', 'Br', '-', ']']

Thank you for your help!

ChangwenXu98 commented 8 months ago

Hi @himanshisyadav, I tried your example and Cl is tokenized to be Cl, not C.

sequence = "[O-][Cl+3]([O-])([O-])[O-]"
print(tokenizer.tokenize(sequence))
['[', 'O', '-', ']', '[', 'Cl', '+', '3', ']', '(', '[', 'O', '-', ']', ')', '(', '[', 'O', '-', ']', ')', '[', 'O', '-', ']']

Maybe you can check your code where the tokenizer is imported and defined. Let me know if you still have any question.

himanshisyadav commented 8 months ago

@ChangwenXu98 Thank you for your quick response! I'll check my code.