MolecularAI / Chemformer

Apache License 2.0
211 stars 36 forks source link

Question on representation #7

Closed chao1224 closed 2 years ago

chao1224 commented 2 years ago

Hi there, I found that you are using the second item in the sequence for sentence-level representation, as shown here. I'm wondering why not taking the first token (the CLS token)?

EBjerrum commented 2 years ago

The comment says: #in the 2rd element is the gene_symbol i.e. 1st in python, so depending on the gene symbol, it can create a different representation. So for our tokenization scheme, the second element is the class. The first token '^' is a start token, that can be used for autoregressive training.