Closed chao1224 closed 2 years ago
The comment says: #in the 2rd element is the gene_symbol i.e. 1st in python, so depending on the gene symbol, it can create a different representation. So for our tokenization scheme, the second element is the class. The first token '^' is a start token, that can be used for autoregressive training.
Hi there, I found that you are using the second item in the sequence for sentence-level representation, as shown here. I'm wondering why not taking the first token (the CLS token)?