About the molecule encoding

QizhiPei commented 1 year ago

Hi,

I really appreciate your nice work.

My questions are about the encoding methods/dictionary/tokenizer about molecule.

After checking HuggingFace tokenizer for MolT5, it seems that MolT5 shares the same tokenizer with the original T5. So it means that the Carbon atom "C" in molecule shares the same embedding with capital English letter "C" in text. And the SMILES string will be tokenized with a T5 text tokenizer.

So what's the insights behind the shared embedding between molecule atom and text? Are there many <unk> tokens under such a encoding method?

Thanks a lot!

cnedwards commented 1 year ago

Hi,

Thanks for your question!

Yes, we used the original T5 tokenizer. We chose to do this because we wanted to take advantage of the pretraining that had gone into T5. An intuition here is that molecules can benefit from natural language pretraining (e.g. nonsense corpus pretraining). Additionally, transformer architectures are capable of handling polysemous tokens because of contextualized representations. However, in future work, using a hybrid tokenizer (different tokens for language and molecules) may (and likely will) be beneficial.

Since the default tokenizer is based on SentencePiece and SMILES strings use fairly common characters, I believe that <unk> tokens are fairly uncommon, although I haven't quantitatively evaluated the distribution.

For example, C1=CC2=C(C(=C1)[O-])NC(=CC2=O)C(=O)O uses no <unk> tokens. It is tokenized as: ['C', '1', '=', 'CC', '2', '=', 'C', '(', 'C', '(', '=', 'C', '1)', '[', 'O', '-', ']', ')', 'NC', '(', '=', 'CC', '2', '=', 'O', ')', 'C', '(', '=', 'O', ')', 'O']

QizhiPei commented 1 year ago

Hi,

Thanks again for your quick reply!

blender-nlp / MolT5

About the molecule encoding #8