IBM / regression-transformer

Regression Transformer (2023; Nature Machine Intelligence)
https://www.nature.com/articles/s42256-023-00639-z
MIT License
135 stars 21 forks source link

Tokenizing example error #7

Closed pjuangph closed 1 year ago

pjuangph commented 1 year ago

I get an error about 'examples' not being defined when I try to run your code from the scripts folder.

Steps:

  1. cd scripts
  2. Add the following codes below to a file
  3. Run from scripts folder

It does work when I replace it with "bert-based-uncased", but I don't get the same token indices.

Do I need to install terminator beforehand? does "examples" correspond to the example folder?

# I added these 3 lines 
import sys
sys.path.insert(0,'terminator')
from tokenization import ExpressionBertTokenizer

# This is your code 
from terminator.tokenization import ExpressionBertTokenizer
tokenizer = ExpressionBertTokenizer.from_pretrained('examples') # Error is happening here
text = '<qed>0.3936|CBr'
tokens = tokenizer.tokenize(text)
print(tokens)
# ['<qed>', '_0_0_', '_._', '_3_-1_', '_9_-2_', '_3_-3_', '_6_-4_', '|', 'C', 'Br']
token_indexes = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
print(token_indexes)
# [16, 17, 18, 28, 45, 34, 35, 19, 15, 63]
 tokenizer.build_inputs_with_special_tokens(token_indexes)
# [12, 16, 17, 18, 28, 45, 34, 35, 19, 15, 63, 13]

I also get an error in ExpressionBertTokenizer. Seems like the super class doesn't like do_lower_case=False to be defined.

pjuangph commented 1 year ago

I didn't create the vocab file.