SMILES tokenizer ids - Githubissues

XinhaoLi74 / SmilesPE

SMILES Pair Encoding: A data-driven substructure representation of chemicals

https://xinhaoli74.github.io/SmilesPE/

Apache License 2.0

175 stars 30 forks source link

SMILES tokenizer ids #2

Open elemets opened 3 years ago

elemets commented 3 years ago

Hi,

Thanks for this project, it looks like it could be really helpful. Sorry if this is a stupid question but I was wondering, once I've tokenized a set of SMILES using the pre-trained SMILES model how would I get the token ids?

Thanks A

lianghsun commented 3 years ago

There is a way you can get the ID for each SMILES

test_spe_word = spe.tokenize(...)
for word in test_spe_word.split(' '):
    print(spe.bpe_codes[spe.bpe_codes_reverse[word]]) # output ID