Closed yangzhao1230 closed 3 months ago
You can look at the lines 131-132 of https://github.com/MAGICS-LAB/DNABERT_2/blob/main/finetune/train.py for more details. Same as BERT, we explicitly tell the tokenizer that current input contains two sequences, then it will concatenate the input_ids of both sequences and assign distinct type_token_ids to them.
This task requires two sequences as input, but neither the paper nor the code provides a detailed implementation method. Could you please provide a simple example of how to do this? For instance, concatenating two sequences together and feeding them into the model to predict the labels seems unreasonable. I would like to know the specific approach.